Geometric Deep Learning
Grids, Groups, Graphs,
Geodesics, and Gauges

Michael M. Bronstein111Imperial College London / USI IDSIA / Twitter, Joan Bruna222New York University, Taco Cohen333Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc., Petar Veličković444DeepMind

Preface

For nearly two millenia since Euclid’s Elements, the word555 [Uncaptioned image] According to a popular belief, the Erlangen Programme was delivered in Klein’s inaugural address in October 1872. Klein indeed gave such a talk (though on December 7 of the same year), but it was for a non-mathematical audience and concerned primarily his ideas of mathematical education. What is now called the ‘Erlangen Programme’ was actually a research prospectus brochure Vergleichende Betrachtungen über neuere geometrische Forschungen (“A comparative review of recent researches in geometry”) he prepared as part of his professor appointment. See Tobies (2019). ‘geometry’ has been synonymous with Euclidean geometry, as no other types of geometry existed. Euclid’s monopoly came to an end in the nineteenth century, with examples of non-Euclidean geometries constructed by Lobachevesky, Bolyai, Gauss, and Riemann. Towards the end of that century, these studies had diverged into disparate fields, with mathematicians and philosophers debating the validity of and relations between these geometries as well as the nature of the “one true geometry”.

A way out of this pickle was shown by a young mathematician Felix Klein, appointed in 1872 as professor in the small Bavarian University of Erlangen. In a research prospectus, which entered the annals of mathematics as the Erlangen Programme, Klein proposed approaching geometry as the study of invariants, i.e. properties unchanged under some class of transformations, called the symmetries of the geometry. This approach created clarity by showing that various geometries known at the time could be defined by an appropriate choice of symmetry transformations, formalized using the language of group theory. For instance, Euclidean geometry is concerned with lengths and angles, because these properties are preserved by the group of Euclidean transformations (rotations and translations), while affine geometry studies parallelism, which is preserved by the group of affine transformations. The relation between these geometries is immediately apparent when considering the respective groups, because the Euclidean group is a subgroup of the affine group, which in turn is a subgroup of the group of projective transformations.

The impact of the Erlangen Programme on geometry was very profound. Furthermore, it spilled to other fields, especially physics, where symmetry principles allowed to derive conservation laws from first principles of symmetry (an astonishing result known as Noether’s Theorem), and even enabled the classification of elementary particles as irreducible representations of the symmetry group. Category theory, now pervasive in pure mathematics, can be “regarded as a continuation of the Klein Erlangen Programme, in the sense that a geometrical space with its group of transformations is generalized to a category with its algebra of mappings”, in the words of its creators Samuel Eilenber and Saunders Mac Lane. 666See Marquis (2009).

At the time of writing, the state of the field of deep learning is somewhat reminiscent of the field of geometry in the nineteenth century. There is a veritable zoo of neural network architectures for various kinds of data, but few unifying principles. As in times past, this makes it difficult to understand the relations between various methods, inevitably resulting in the reinvention and re-branding of the same concepts in different application domains. For a novice trying to learn the field, absorbing the sheer volume of redundant ideas is a true nightmare.

In this text, we make a modest attempt to apply the Erlangen Programme mindset to the domain of deep learning, with the ultimate goal of obtaining a systematisation of this field and ‘connecting the dots’. We call this geometrisation attempt ‘Geometric Deep Learning’, and true to the spirit of Felix Klein, propose to derive different inductive biases and network architectures implementing them from first principles of symmetry and invariance. In particular, we focus on a large class of neural networks designed for analysing unstructured sets, grids, graphs, and manifolds, and show that they can be understood in a unified manner as methods that respect the structure and symmetries of these domains.

We believe this text would appeal to a broad audience of deep learning researchers, practitioners, and enthusiasts. A novice may use it as an overview and introduction to Geometric Deep Learning. A seasoned deep learning expert may discover new ways of deriving familiar architectures from basic principles and perhaps some surprising connections. Practitioners may get new insights on how to solve problems in their respective fields.

With such a fast-paced field as modern machine learning, the risk of writing a text like this is that it becomes obsolete and irrelevant before it sees the light of day. Having focused on foundations, our hope is that the key concepts we discuss will transcend their specific realisations777“The knowledge of certain principles easily compensates the lack of knowledge of certain facts.” (Helvétius, 1759)— or, as Claude Adrien Helvétius put it, “la connaissance de certains principes supplée facilement à la connoissance de certains faits.”

Notation

Ω,uΩ𝑢\Omega,uroman_Ω , italic_u

Domain, point on domain

x(u)𝒳(Ω,𝒞)𝑥𝑢𝒳Ω𝒞x(u)\in\mathcal{X}(\Omega,\mathcal{C})italic_x ( italic_u ) ∈ caligraphic_X ( roman_Ω , caligraphic_C )

Signal on the domain of the form x:Ω𝒞:𝑥Ω𝒞x:\Omega\rightarrow\mathcal{C}italic_x : roman_Ω → caligraphic_C

f(x)(𝒳(Ω))𝑓𝑥𝒳Ωf(x)\in\mathcal{F}(\mathcal{X}(\Omega))italic_f ( italic_x ) ∈ caligraphic_F ( caligraphic_X ( roman_Ω ) )

Functions on signals on the domain of the form f:𝒳(Ω)𝒴:𝑓𝒳Ω𝒴f:\mathcal{X}(\Omega)\rightarrow\mathcal{Y}italic_f : caligraphic_X ( roman_Ω ) → caligraphic_Y

𝔊,𝔤𝔊𝔤{\mathfrak{G}},{\mathfrak{g}}fraktur_G , fraktur_g

Group, element of the group

𝔤.u,ρ(𝔤)formulae-sequence𝔤𝑢𝜌𝔤{\mathfrak{g}}.u,\rho({\mathfrak{g}})fraktur_g . italic_u , italic_ρ ( fraktur_g )

Group action, group representation

𝐗𝒞|Ω|×s𝐗superscript𝒞Ω𝑠\mathbf{X}\in\mathcal{C}^{|\Omega|\times s}bold_X ∈ caligraphic_C start_POSTSUPERSCRIPT | roman_Ω | × italic_s end_POSTSUPERSCRIPT

Matrix representing a signal on a discrete domain

𝐱u𝒞ssubscript𝐱𝑢superscript𝒞𝑠\mathbf{x}_{u}\in\mathcal{C}^{s}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT

Vector representing a discrete domain signal 𝐗𝐗\mathbf{X}bold_X on element uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω

xuj𝒞subscript𝑥𝑢𝑗𝒞x_{uj}\in\mathcal{C}italic_x start_POSTSUBSCRIPT italic_u italic_j end_POSTSUBSCRIPT ∈ caligraphic_C

Scalar representing the j𝑗jitalic_jth component of a discrete domain signal 𝐗𝐗\mathbf{X}bold_X on element uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω

𝐅(𝐗)𝐅𝐗\mathbf{F}(\mathbf{X})bold_F ( bold_X )

Function on discrete domain signals that returns another discrete domain signal, as a matrix

τ:ΩΩ:𝜏ΩΩ\tau:\Omega\rightarrow\Omegaitalic_τ : roman_Ω → roman_Ω

Automorphism of the domain

η:ΩΩ:𝜂ΩsuperscriptΩ\eta:\Omega\rightarrow\Omega^{\prime}italic_η : roman_Ω → roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Isomorphism between two different domains

σ:𝒞𝒞:𝜎𝒞superscript𝒞\sigma:\mathcal{C}\rightarrow\mathcal{C}^{\prime}italic_σ : caligraphic_C → caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Activation function (point-wise non-linearity)

G=(𝒱,)𝐺𝒱G=(\mathcal{V},\mathcal{E})italic_G = ( caligraphic_V , caligraphic_E )

Graph with nodes 𝒱𝒱\mathcal{V}caligraphic_V and edges \mathcal{E}caligraphic_E

𝒯=(𝒱,,)𝒯𝒱\mathcal{T}=(\mathcal{V},\mathcal{E},\mathcal{F})caligraphic_T = ( caligraphic_V , caligraphic_E , caligraphic_F )

Mesh with nodes 𝒱𝒱\mathcal{V}caligraphic_V, edges \mathcal{E}caligraphic_E, and faces \mathcal{F}caligraphic_F

xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ

Convolution with filter θ𝜃\thetaitalic_θ

Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

Shift operator

φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Basis function

TuΩ,TΩsubscript𝑇𝑢Ω𝑇ΩT_{u}\Omega,T\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω , italic_T roman_Ω

Tangent space at u𝑢uitalic_u, tangent bundle

XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω

Tangent vector

gu(X,Y)=X,Yusubscript𝑔𝑢𝑋𝑌subscript𝑋𝑌𝑢g_{u}(X,Y)=\langle X,Y\rangle_{u}italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X , italic_Y ) = ⟨ italic_X , italic_Y ⟩ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT

Riemannian metric

(γ),uv𝛾subscript𝑢𝑣\ell(\gamma),\ell_{uv}roman_ℓ ( italic_γ ) , roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT

Length of a curve γ𝛾\gammaitalic_γ, discrete metric on edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v )

1 Introduction

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to be beyond reach – such as computer vision, playing Go, or protein folding – are in fact feasible with appropriate computational scale. Remarkably, the essence of deep learning is built from two simple algorithmic principles: first, the notion of representation or feature learning, whereby adapted, often hierarchical, features capture the appropriate notion of regularity for each task, and second, learning by local gradient-descent, typically implemented as backpropagation.

While learning generic functions in high dimensions is a cursed estimation problem, most tasks of interest are not generic, and come with essential pre-defined regularities arising from the underlying low-dimensionality and structure of the physical world. This text is concerned with exposing these regularities through unified geometric principles that can be applied throughout a wide spectrum of applications.

Exploiting the known symmetries of a large system is a powerful and classical remedy against the curse of dimensionality, and forms the basis of most physical theories. Deep learning systems are no exception, and since the early days researchers have adapted neural networks to exploit the low-dimensional geometry arising from physical measurements, e.g. grids in images, sequences in time-series, or position and momentum in molecules, and their associated symmetries, such as translation or rotation. Throughout our exposition, we will describe these models, as well as many others, as natural instances of the same underlying principle of geometric regularity.

Such a ‘geometric unification’ endeavour in the spirit of the Erlangen Program serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented.

Before proceeding, it is worth noting that our work concerns representation learning architectures and exploiting the symmetries of data therein. The many exciting pipelines where such representations may be used (such as self-supervised learning, generative modelling, or reinforcement learning) are not our central focus888The same applies for techniques used for optimising or regularising our architectures, such as Adam (Kingma and Ba, 2014), dropout (Srivastava et al., 2014) or batch normalisation (Ioffe and Szegedy, 2015).. Hence, we will not review in depth influential neural pipelines such as variational autoencoders (Kingma and Welling, 2013), generative adversarial networks (Goodfellow et al., 2014), normalising flows (Rezende and Mohamed, 2015), deep Q-networks (Mnih et al., 2015), proximal policy optimisation (Schulman et al., 2017), or deep mutual information maximisation (Hjelm et al., 2019). That being said, we believe that the principles we will focus on are of significant importance in all of these areas.

Further, while we have attempted to cast a reasonably wide net in order to illustrate the power of our geometric blueprint, our work does not attempt to accurately summarise the entire existing wealth of research on Geometric Deep Learning. Rather, we study several well-known architectures in-depth in order to demonstrate the principles and ground them in existing research, with the hope that we have left sufficient references for the reader to meaningfully apply these principles to any future geometric deep architecture they encounter or devise.

2 Learning in High Dimensions

Supervised machine learning, in its simplest formalisation, considers a set of N𝑁Nitalic_N observations 𝒟={(xi,yi)}i=1N𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁{\mathcal{D}}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT drawn i.i.d. from an underlying data distribution P𝑃Pitalic_P defined over 𝒳×𝒴𝒳𝒴{\mathcal{X}}\times{\mathcal{Y}}caligraphic_X × caligraphic_Y, where 𝒳𝒳{\mathcal{X}}caligraphic_X and 𝒴𝒴{\mathcal{Y}}caligraphic_Y are respectively the data and the label domains. The defining feature in this setup is that 𝒳𝒳{\mathcal{X}}caligraphic_X is a high-dimensional space: one typically assumes 𝒳=d𝒳superscript𝑑{\mathcal{X}}=\mathbb{R}^{d}caligraphic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to be a Euclidean space of large dimension d𝑑ditalic_d.

Let us further assume that the labels y𝑦yitalic_y are generated by an unknown function f𝑓fitalic_f, such that yi=f(xi)subscript𝑦𝑖𝑓subscript𝑥𝑖y_{i}=f(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and the learning problem reduces to estimating the function f𝑓fitalic_f using a parametrised function class ={f𝜽Θ}subscript𝑓𝜽Θ{\mathcal{F}}=\{f_{\bm{\theta}\in\Theta}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT bold_italic_θ ∈ roman_Θ end_POSTSUBSCRIPT }. Neural networks are a common realisation of such parametric function classes, in which case 𝜽Θ𝜽Θ\bm{\theta}\in\Thetabold_italic_θ ∈ roman_Θ corresponds to the network weights. In this idealised setup, there is no noise in the labels, and modern deep learning systems typically operate in the so-called interpolating regime, where the estimated f~~𝑓\tilde{f}\in{\mathcal{F}}over~ start_ARG italic_f end_ARG ∈ caligraphic_F satisfies f~(xi)=f(xi)~𝑓subscript𝑥𝑖𝑓subscript𝑥𝑖\tilde{f}(x_{i})=f(x_{i})over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for all i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N. The performance of a learning algorithm is measured in terms of the expected performance 999Statistical learning theory is concerned with more refined notions of generalisation based on concentration inequalities; we will review some of these in future work.on new samples drawn from P𝑃{P}italic_P, using some loss L(,)𝐿L(\cdot,\cdot)italic_L ( ⋅ , ⋅ )

(f~):=𝔼PL(f~(x),f(x)),assign~𝑓subscript𝔼𝑃𝐿~𝑓𝑥𝑓𝑥{\mathcal{R}}(\tilde{f}):=\mathbb{E}_{P}\,\,L(\tilde{f}(x),f(x)),caligraphic_R ( over~ start_ARG italic_f end_ARG ) := blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_L ( over~ start_ARG italic_f end_ARG ( italic_x ) , italic_f ( italic_x ) ) ,

with the squared-loss L(y,y)=12|yy|2𝐿𝑦superscript𝑦12superscript𝑦superscript𝑦2L(y,y^{\prime})=\frac{1}{2}|y-y^{\prime}|^{2}italic_L ( italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT being among the most commonly used ones.

A successful learning scheme thus needs to encode the appropriate notion of regularity or inductive bias for f𝑓fitalic_f, imposed through the construction of the function class \mathcal{F}caligraphic_F and the use of regularisation. We briefly introduce this concept in the following section.

2.1 Inductive Bias via Function Regularity

Modern machine learning operates with large, high-quality datasets, which, together with appropriate computational resources, motivate the design of rich function classes {\mathcal{F}}caligraphic_F with the capacity to interpolate such large data. This mindset plays well with neural networks, since even the simplest choices of architecture yields a dense class of functions.101010A set 𝒜𝒳𝒜𝒳\mathcal{A}\subset\mathcal{X}caligraphic_A ⊂ caligraphic_X is said to be dense in 𝒳𝒳\mathcal{X}caligraphic_X if its closure 𝒜{limiai:ai𝒜}=𝒳.𝒜conditional-setsubscript𝑖subscript𝑎𝑖subscript𝑎𝑖𝒜𝒳\mathcal{A}\cup\{\displaystyle\lim_{i\rightarrow\infty}a_{i}:a_{i}\in\mathcal{A}\}=\mathcal{X}.caligraphic_A ∪ { roman_lim start_POSTSUBSCRIPT italic_i → ∞ end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A } = caligraphic_X . This implies that any point in 𝒳𝒳\mathcal{X}caligraphic_X is arbitrarily close to a point in 𝒜𝒜\mathcal{A}caligraphic_A. A typical Universal Approximation result shows that the class of functions represented e.g. by a two-layer perceptron, f(𝐱)=𝐜sign(𝐀𝐱+𝐛)𝑓𝐱superscript𝐜topsign𝐀𝐱𝐛f(\mathbf{x})=\mathbf{c}^{\top}\mathrm{sign}(\mathbf{A}\mathbf{x}+\mathbf{b})italic_f ( bold_x ) = bold_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_sign ( bold_Ax + bold_b ) is dense in the space of continuous functions on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The capacity to approximate almost arbitrary functions is the subject of various Universal Approximation Theorems; several such results were proved and popularised in the 1990s by applied mathematicians and computer scientists (see e.g. Cybenko (1989); Hornik (1991); Barron (1993); Leshno et al. (1993); Maiorov (1999); Pinkus (1999)).

Refer to caption
Figure 1: Multilayer Perceptrons (Rosenblatt, 1958), the simplest feed-forward neural networks, are universal approximators: with just one hidden layer, they can represent combinations of step functions, allowing to approximate any continuous function with arbitrary precision.

Universal Approximation, however, does not imply an absence of inductive bias. Given a hypothesis space {\mathcal{F}}caligraphic_F with universal approximation, we can define a complexity measure c:+:𝑐subscriptc:{\mathcal{F}}\to\mathbb{R}_{+}italic_c : caligraphic_F → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and redefine our interpolation problem as

f~argmingc(g)s.t.g(xi)=f(xi)fori=1,,N,\tilde{f}\in\arg\min_{g\in{\mathcal{F}}}c(g)\quad\mathrm{s.t.}\quad g(x_{i})=f(x_{i})\quad\mathrm{for}\,\,\,i=1,\ldots,N,over~ start_ARG italic_f end_ARG ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_g ∈ caligraphic_F end_POSTSUBSCRIPT italic_c ( italic_g ) roman_s . roman_t . italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_for italic_i = 1 , … , italic_N ,

i.e., we are looking for the most regular functions within our hypothesis class. For standard function spaces, this complexity measure can be defined as a norm,111111 Informally, a norm xnorm𝑥\|x\|∥ italic_x ∥ can be regarded as a “length” of vector x𝑥xitalic_x. A Banach space is a complete vector space equipped with a norm. making {\mathcal{F}}caligraphic_F a Banach space and allowing to leverage a plethora of theoretical results in functional analysis. In low dimensions, splines are a workhorse for function approximation. They can be formulated as above, with a norm capturing the classical notion of smoothness, such as the squared-norm of second-derivatives +|f′′(x)|2dxsuperscriptsubscriptsuperscriptsuperscript𝑓′′𝑥2differential-d𝑥\int_{-\infty}^{+\infty}|f^{\prime\prime}(x)|^{2}\mathrm{d}x∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT | italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_x for cubic splines.

In the case of neural networks, the complexity measure c𝑐citalic_c can be expressed in terms of the network weights, i.e. c(f𝜽)=c(𝜽)𝑐subscript𝑓𝜽𝑐𝜽c(f_{\bm{\theta}})={c}(\bm{\theta})italic_c ( italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) = italic_c ( bold_italic_θ ). The L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of the network weights, known as weight decay, or the so-called path-norm (Neyshabur et al., 2015) are popular choices in deep learning literature. From a Bayesian perspective, such complexity measures can also be interpreted as the negative log of the prior for the function of interest. More generally, this complexity can be enforced explicitly by incorporating it into the empirical loss (resulting in the so-called Structural Risk Minimisation), or implicitly, as a result of a certain optimisation scheme. For example, it is well-known that gradient-descent on an under-determined least-squares objective will choose interpolating solutions with minimal L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. The extension of such implicit regularisation results to modern neural networks is the subject of current studies (see e.g. Blanc et al. (2020); Shamir and Vardi (2020); Razin and Cohen (2020); Gunasekar et al. (2017)). All in all, a natural question arises: how to define effective priors that capture the expected regularities and complexities of real-world prediction tasks?

2.2 The Curse of Dimensionality

While interpolation in low-dimensions (with d=1,2𝑑12d=1,2italic_d = 1 , 2 or 3333) is a classic signal processing task with very precise mathematical control of estimation errors using increasingly sophisticated regularity classes (such as spline interpolants, wavelets, curvelets, or ridgelets), the situation for high-dimensional problems is entirely different.

In order to convey the essence of the idea, let us consider a classical notion of regularity that can be easily extended to high dimensions: 1-Lipschitz- functions f:𝒳:𝑓𝒳f:\mathcal{X}\to\mathbb{R}italic_f : caligraphic_X → blackboard_R, i.e. functions satisfying |f(x)f(x)|xx𝑓𝑥𝑓superscript𝑥norm𝑥superscript𝑥|f(x)-f(x^{\prime})|\leq\|x-x^{\prime}\|| italic_f ( italic_x ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ for all x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X. This hypothesis only asks the target function to be locally smooth, i.e., if we perturb the input x𝑥xitalic_x slightly (as measured by the norm xxnorm𝑥superscript𝑥\|x-x^{\prime}\|∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥), the output f(x)𝑓𝑥f(x)italic_f ( italic_x ) is not allowed to change much. If our only knowledge of the target function f𝑓fitalic_f is that it is 1111-Lipschitz, how many observations do we expect to require to ensure that our estimate f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG will be close to f𝑓fitalic_f? Figure 2 reveals that the general answer is necessarily exponential in the dimension d𝑑ditalic_d, signaling that the Lipschitz class grows ‘too quickly’ as the input dimension increases: in many applications with even modest dimension d𝑑ditalic_d, the number of samples would be bigger than the number of atoms in the universe. The situation is not better if one replaces the Lipschitz class by a global smoothness hypothesis, such as the Sobolev Class s(Ωd)superscript𝑠subscriptΩ𝑑{\mathcal{H}}^{s}(\Omega_{d})caligraphic_H start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )121212A function f𝑓fitalic_f is in the Sobolev class s(Ωd)superscript𝑠subscriptΩ𝑑{\mathcal{H}}^{s}(\Omega_{d})caligraphic_H start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) if fL2(Ωd)𝑓superscript𝐿2subscriptΩ𝑑f\in L^{2}(\Omega_{d})italic_f ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and the generalised s𝑠sitalic_s-th order derivative is square-integrable: |ω|2s+1|f^(ω)|2𝑑ω<superscript𝜔2𝑠1superscript^𝑓𝜔2differential-d𝜔\int|\omega|^{2s+1}|\hat{f}(\omega)|^{2}d\omega<\infty∫ | italic_ω | start_POSTSUPERSCRIPT 2 italic_s + 1 end_POSTSUPERSCRIPT | over^ start_ARG italic_f end_ARG ( italic_ω ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ω < ∞, where f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is the Fourier transform of f𝑓fitalic_f; see Section 4.2. . Indeed, classic results (Tsybakov, 2008) establish a minimax rate of approximation and learning for the Sobolev class of the order ϵd/ssuperscriptitalic-ϵ𝑑𝑠\epsilon^{-d/s}italic_ϵ start_POSTSUPERSCRIPT - italic_d / italic_s end_POSTSUPERSCRIPT, showing that the extra smoothness assumptions on f𝑓fitalic_f only improve the statistical picture when sdproportional-to𝑠𝑑s\propto ditalic_s ∝ italic_d, an unrealistic assumption in practice.

Refer to caption
Figure 2: We consider a Lipschitz function f(x)=j=12dzjϕ(xxj)𝑓𝑥superscriptsubscript𝑗1superscript2𝑑subscript𝑧𝑗italic-ϕ𝑥subscript𝑥𝑗f(x)=\sum_{j=1}^{2^{d}}z_{j}\phi(x-x_{j})italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_x - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) where zj=±1subscript𝑧𝑗plus-or-minus1z_{j}=\pm 1italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ± 1, xjdsubscript𝑥𝑗superscript𝑑x_{j}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is placed in each quadrant, and ϕitalic-ϕ\phiitalic_ϕ a locally supported Lipschitz ‘bump’. Unless we observe the function in most of the 2dsuperscript2𝑑2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT quadrants, we will incur in a constant error in predicting it. This simple geometric argument can be formalised through the notion of Maximum Discrepancy (von Luxburg and Bousquet, 2004), defined for the Lipschitz class as κ(d)=𝔼x,xsupfLip(1)|1Nlf(xl)1Nlf(xl)|N1/d𝜅𝑑subscript𝔼𝑥superscript𝑥subscriptsupremum𝑓Lip11𝑁subscript𝑙𝑓subscript𝑥𝑙1𝑁subscript𝑙𝑓subscriptsuperscript𝑥𝑙similar-to-or-equalssuperscript𝑁1𝑑\kappa(d)=\mathbb{E}_{x,x^{\prime}}\sup_{f\in\mathrm{Lip}(1)}\left|\frac{1}{N}\sum_{l}f(x_{l})-\frac{1}{N}\sum_{l}f(x^{\prime}_{l})\right|\simeq N^{-1/d}italic_κ ( italic_d ) = blackboard_E start_POSTSUBSCRIPT italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ roman_Lip ( 1 ) end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ≃ italic_N start_POSTSUPERSCRIPT - 1 / italic_d end_POSTSUPERSCRIPT, which measures the largest expected discrepancy between two independent N𝑁Nitalic_N-sample expectations. Ensuring that κ(d)ϵsimilar-to-or-equals𝜅𝑑italic-ϵ\kappa(d)\simeq\epsilonitalic_κ ( italic_d ) ≃ italic_ϵ requires N=Θ(ϵd)𝑁Θsuperscriptitalic-ϵ𝑑N=\Theta(\epsilon^{-d})italic_N = roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ); the corresponding sample {xl}lsubscriptsubscript𝑥𝑙𝑙\{x_{l}\}_{l}{ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT defines an ϵitalic-ϵ\epsilonitalic_ϵ-net of the domain. For a d𝑑ditalic_d-dimensional Euclidean domain of diameter 1111, its size grows exponentially as ϵdsuperscriptitalic-ϵ𝑑\epsilon^{-d}italic_ϵ start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT.

Fully-connected neural networks define function spaces that enable more flexible notions of regularity, obtained by considering complexity functions c𝑐citalic_c on their weights. In particular, by choosing a sparsity-promoting regularisation, they have the ability to break this curse of dimensionality (Bach, 2017). However, this comes at the expense of making strong assumptions on the nature of the target function f𝑓fitalic_f, such as that f𝑓fitalic_f depends on a collection of low-dimensional projections of the input (see Figure 3). In most real-world applications (such as computer vision, speech analysis, physics, or chemistry), functions of interest tend to exhibits complex long-range correlations that cannot be expressed with low-dimensional projections (Figure 3), making this hypothesis unrealistic. It is thus necessary to define an alternative source of regularity, by exploiting the spatial structure of the physical domain and the geometric priors of f𝑓fitalic_f, as we describe in the next Section 3.

Refer to caption
Figure 3: If the unknown function f𝑓fitalic_f is presumed to be well approximated as f(𝐱)g(𝐀𝐱)𝑓𝐱𝑔𝐀𝐱f(\mathbf{x})\approx g(\mathbf{A}\mathbf{x})italic_f ( bold_x ) ≈ italic_g ( bold_Ax ) for some unknown 𝐀k×d𝐀superscript𝑘𝑑\mathbf{A}\in\mathbb{R}^{k\times d}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT with kdmuch-less-than𝑘𝑑k\ll ditalic_k ≪ italic_d, then shallow neural networks can capture this inductive bias, see e.g. Bach (2017). In typical applications, such dependency on low-dimensional projections is unrealistic, as illustrated in this example: a low-pass filter projects the input images to a low-dimensional subspace; while it conveys most of the semantics, substantial information is lost.

3 Geometric Priors

Modern data analysis is synonymous with high-dimensional learning. While the simple arguments of Section 2.1 reveal the impossibility of learning from generic high-dimensional data as a result of the curse of dimensionality, there is hope for physically-structured data, where we can employ two fundamental principles: symmetry and scale separation. In the settings considered in this text, this additional structure will usually come from the structure of the domain underlying the input signals: we will assume that our machine learning system operates on signals (functions) on some domain ΩΩ\Omegaroman_Ω. While in many cases linear combinations of points on ΩΩ\Omegaroman_Ω is not well-defined131313ΩΩ\Omegaroman_Ω must be a vector space in order for an expression αu+βv𝛼𝑢𝛽𝑣\alpha u+\beta vitalic_α italic_u + italic_β italic_v to make sense., we can linearly combine signals on it, i.e., the space of signals forms a vector space. Moreover, since we can define an inner product between signals, this space is a Hilbert space.

The space of 𝒞𝒞\mathcal{C}caligraphic_C-valued signals on ΩΩ\Omegaroman_Ω 141414When ΩΩ\Omegaroman_Ω has some additional structure, we may further restrict the kinds of signals in 𝒳(Ω,𝒞)𝒳Ω𝒞\mathcal{X}(\Omega,\mathcal{C})caligraphic_X ( roman_Ω , caligraphic_C ). For example, when ΩΩ\Omegaroman_Ω is a smooth manifold, we may require the signals to be smooth. Whenever possible, we will omit the range 𝒞𝒞\mathcal{C}caligraphic_C for brevity. (for ΩΩ\Omegaroman_Ω a set, possibly with additional structure, and 𝒞𝒞\mathcal{C}caligraphic_C a vector space, whose dimensions are called channels) 𝒳(Ω,𝒞)={x:Ω𝒞}𝒳Ω𝒞conditional-set𝑥Ω𝒞\mathcal{X}(\Omega,\mathcal{C})=\{x:\Omega\rightarrow\mathcal{C}\}caligraphic_X ( roman_Ω , caligraphic_C ) = { italic_x : roman_Ω → caligraphic_C } (1) is a function space that has a vector space structure. Addition and scalar multiplication of signals is defined as: (αx+βy)(u)=αx(u)+βy(u)for alluΩ,formulae-sequence𝛼𝑥𝛽𝑦𝑢𝛼𝑥𝑢𝛽𝑦𝑢for all𝑢Ω(\alpha x+\beta y)(u)=\alpha x(u)+\beta y(u)\quad\text{for all}\quad u\in\Omega,( italic_α italic_x + italic_β italic_y ) ( italic_u ) = italic_α italic_x ( italic_u ) + italic_β italic_y ( italic_u ) for all italic_u ∈ roman_Ω , with real scalars α,β𝛼𝛽\alpha,\betaitalic_α , italic_β. Given an inner product v,w𝒞subscript𝑣𝑤𝒞\langle v,w\rangle_{\mathcal{C}}⟨ italic_v , italic_w ⟩ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT on 𝒞𝒞\mathcal{C}caligraphic_C and a measure151515When the domain ΩΩ\Omegaroman_Ω is discrete, μ𝜇\muitalic_μ can be chosen as the counting measure, in which case the integral becomes a sum. In the following, we will omit the measure and use dud𝑢\mathrm{d}uroman_d italic_u for brevity. μ𝜇\muitalic_μ on ΩΩ\Omegaroman_Ω (with respect to which we can define an integral), we can define an inner product on 𝒳(Ω,𝒞)𝒳Ω𝒞\mathcal{X}(\Omega,\mathcal{C})caligraphic_X ( roman_Ω , caligraphic_C ) as x,y=Ωx(u),y(u)𝒞dμ(u).𝑥𝑦subscriptΩsubscript𝑥𝑢𝑦𝑢𝒞differential-d𝜇𝑢\langle x,y\rangle=\int_{\Omega}\langle x(u),\,y(u)\rangle_{\mathcal{C}}\;\mathrm{d}\mu(u).⟨ italic_x , italic_y ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ⟨ italic_x ( italic_u ) , italic_y ( italic_u ) ⟩ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT roman_d italic_μ ( italic_u ) . (2)

As a typical illustration, take Ω=n×nΩsubscript𝑛subscript𝑛\Omega=\mathbb{Z}_{n}\times\mathbb{Z}_{n}roman_Ω = blackboard_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × blackboard_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to be a two-dimensional n×n𝑛𝑛n\times nitalic_n × italic_n grid, x𝑥xitalic_x an RGB image (i.e. a signal x:Ω3:𝑥Ωsuperscript3x:\Omega\rightarrow\mathbb{R}^{3}italic_x : roman_Ω → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT), and f𝑓fitalic_f a function (such as a single-layer Perceptron) operating on 3n23superscript𝑛23n^{2}3 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-dimensional inputs. As we will see in the following with greater detail, the domain ΩΩ\Omegaroman_Ω is usually endowed with certain geometric structure and symmetries. Scale separation results from our ability to preserve important characteristics of the signal when transferring it onto a coarser version of the domain (in our example, subsampling the image by coarsening the underlying grid).

We will show that both principles, to which we will generically refer as geometric priors, are prominent in most modern deep learning architectures. In the case of images considered above, geometric priors are built into Convolutional Neural Networks (CNNs) in the form of convolutional filters with shared weights (exploiting translational symmetry) and pooling (exploiting scale separation). Extending these ideas to other domains such as graphs and manifolds and showing how geometric priors emerge from fundamental principles is the main goal of Geometric Deep Learning and the leitmotif of our text.

3.1 Symmetries, Representations, and Invariance

Informally, a symmetry of an object or system is a transformation that leaves a certain property of said object or system unchanged or invariant. Such transformations may be either smooth, continuous, or discrete. Symmetries are ubiquitous in many machine learning tasks. For example, in computer vision the object category is unchanged by shifts, so shifts are symmetries in the problem of visual object classification. In computational chemistry, the task of predicting properties of molecules independently of their orientation in space requires rotational invariance. Discrete symmetries emerge naturally when describing particle systems where particles do not have canonical ordering and thus can be arbitrarily permuted, as well as in many dynamical systems, via the time-reversal symmetry (such as systems in detailed balance or the Newton’s second law of motion). As we will see in Section 4.1, permutation symmetries are also central to the analysis of graph-structured data.

Symmetry groups

The set of symmetries of an object satisfies a number of properties. First, symmetries may be combined to obtain new symmetries: if 𝔤𝔤{\mathfrak{g}}fraktur_g and 𝔥𝔥{\mathfrak{h}}fraktur_h are two symmetries, then their compositions 𝔤𝔥𝔤𝔥{\mathfrak{g}}\circ{\mathfrak{h}}fraktur_g ∘ fraktur_h and 𝔥𝔤𝔥𝔤{\mathfrak{h}}\circ{\mathfrak{g}}fraktur_h ∘ fraktur_g 161616We will follow the juxtaposition notation convention used in group theory, 𝔤𝔥=𝔤𝔥𝔤𝔥𝔤𝔥{\mathfrak{g}}\circ{\mathfrak{h}}={\mathfrak{g}}{\mathfrak{h}}fraktur_g ∘ fraktur_h = fraktur_g fraktur_h, which should be read right-to-left: we first apply 𝔥𝔥{\mathfrak{h}}fraktur_h and then 𝔤𝔤{\mathfrak{g}}fraktur_g. The order is important, as in many cases symmetries are non-commutative. Readers familiar with Lie groups might be disturbed by our choice to use the Fraktur font to denote group elements, as it is a common notation of Lie algebras. are also symmetries. The reason is that if both transformations leave the object invariant, then so does the composition of transformations, and hence the composition is also a symmetry. Furthermore, symmetries are always invertible, and the inverse is also a symmetry. This shows that the collection of all symmetries form an algebraic object known as a group. Since these objects will be a centerpiece of the mathematical model of Geometric Deep Learning, they deserve a formal definition and detailed discussion:

A group is a set 𝔊𝔊{\mathfrak{G}}fraktur_G along with a binary operation :𝔊×𝔊𝔊\circ:{\mathfrak{G}}\times{\mathfrak{G}}\rightarrow{\mathfrak{G}}∘ : fraktur_G × fraktur_G → fraktur_G called composition (for brevity, denoted by juxtaposition 𝔤𝔥=𝔤𝔥𝔤𝔥𝔤𝔥{\mathfrak{g}}\circ{\mathfrak{h}}={\mathfrak{g}}{\mathfrak{h}}fraktur_g ∘ fraktur_h = fraktur_g fraktur_h) satisfying the following axioms: Associativity: (𝔤𝔥)𝔨=𝔤(𝔥𝔨)𝔤𝔥𝔨𝔤𝔥𝔨({\mathfrak{g}}{\mathfrak{h}}){\mathfrak{k}}={\mathfrak{g}}({\mathfrak{h}}{\mathfrak{k}})( fraktur_g fraktur_h ) fraktur_k = fraktur_g ( fraktur_h fraktur_k ) for all 𝔤,𝔥,𝔨𝔊𝔤𝔥𝔨𝔊{\mathfrak{g}},{\mathfrak{h}},{\mathfrak{k}}\in{\mathfrak{G}}fraktur_g , fraktur_h , fraktur_k ∈ fraktur_G. Identity: there exists a unique 𝔢𝔊𝔢𝔊{\mathfrak{e}}\in{\mathfrak{G}}fraktur_e ∈ fraktur_G satisfying 𝔢𝔤=𝔤𝔢=𝔤𝔢𝔤𝔤𝔢𝔤{\mathfrak{e}}{\mathfrak{g}}={\mathfrak{g}}{\mathfrak{e}}={\mathfrak{g}}fraktur_e fraktur_g = fraktur_g fraktur_e = fraktur_g for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G. Inverse: For each 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G there is a unique inverse 𝔤1𝔊superscript𝔤1𝔊{\mathfrak{g}}^{-1}\in{\mathfrak{G}}fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∈ fraktur_G such that 𝔤𝔤1=𝔤1𝔤=𝔢𝔤superscript𝔤1superscript𝔤1𝔤𝔢{\mathfrak{g}}{\mathfrak{g}}^{-1}={\mathfrak{g}}^{-1}{\mathfrak{g}}={\mathfrak{e}}fraktur_g fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT fraktur_g = fraktur_e. Closure: The group is closed under composition, i.e., for every 𝔤,𝔥𝔊𝔤𝔥𝔊{\mathfrak{g}},{\mathfrak{h}}\in{\mathfrak{G}}fraktur_g , fraktur_h ∈ fraktur_G, we have 𝔤𝔥𝔊𝔤𝔥𝔊{\mathfrak{g}}{\mathfrak{h}}\ \in{\mathfrak{G}}fraktur_g fraktur_h ∈ fraktur_G.

Note that commutativity is not part of this definition, i.e. we may have 𝔤𝔥𝔥𝔤𝔤𝔥𝔥𝔤{\mathfrak{g}}{\mathfrak{h}}\neq{\mathfrak{h}}{\mathfrak{g}}fraktur_g fraktur_h ≠ fraktur_h fraktur_g. Groups for which 𝔤𝔥=𝔥𝔤𝔤𝔥𝔥𝔤{\mathfrak{g}}{\mathfrak{h}}={\mathfrak{h}}{\mathfrak{g}}fraktur_g fraktur_h = fraktur_h fraktur_g for all 𝔤,𝔥𝔊𝔤𝔥𝔊{\mathfrak{g}},{\mathfrak{h}}\in{\mathfrak{G}}fraktur_g , fraktur_h ∈ fraktur_G are called commutative or Abelian171717After the Norwegian mathematician Niels Henrik Abel (1802–1829)..

Though some groups can be very large and even infinite, they often arise from compositions of just a few elements, called group generators. Formally, 𝔊𝔊\mathfrak{G}fraktur_G is said to be generated by a subset S𝔊𝑆𝔊S\subseteq\mathfrak{G}italic_S ⊆ fraktur_G (called the group generator) if every element 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G can be written as a finite composition of the elements of S𝑆Sitalic_S and their inverses. For instance, the symmetry group of an equilateral triangle (dihedral group D3subscriptD3\mathrm{D}_{3}roman_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) is generated by a 60superscript6060^{\circ}60 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT rotation and a reflection (Figure 4). The 1D translation group, which we will discuss in detail in the following, is generated by infinitesimal displacements; this is an example of a Lie group of differentiable symmetries.181818Lie groups have a differentiable manifold structure. One such example that we will study in Section 4.3 is the special orthogonal group SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ), which is a 3-dimensional manifold.

Note that here we have defined a group as an abstract object, without saying what the group elements are (e.g. transformations of some domain), only how they compose. Hence, very different kinds of objects may have the same symmetry group. For instance, the aforementioned group of rotational and reflection symmetries of a triangle is the same as the group of permutations of a sequence of three elements (we can permute the corners in the triangle in any way using a rotation and reflection – see Figure 4)191919The diagram shown in Figure 4 (where each node is associated with a group element, and each arrow with a generator), is known as the Cayley diagram..

Refer to caption    Refer to caption

Figure 4: Left: an equilateral triangle with corners labelled by 1,2,31231,2,31 , 2 , 3, and all possible rotations and reflections of the triangle. The group D3subscriptD3\mathrm{D}_{3}roman_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of rotation/reflection symmetries of the triangle is generated by only two elements (rotation by 60superscript6060^{\circ}60 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT R and reflection F) and is the same as the group Σ3subscriptΣ3\Sigma_{3}roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of permutations of three elements. Right: the multiplication table of the group D3subscriptD3\mathrm{D}_{3}roman_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The element in the row 𝔤𝔤{\mathfrak{g}}fraktur_g and column 𝔥𝔥{\mathfrak{h}}fraktur_h corresponds to the element 𝔤𝔥𝔤𝔥{\mathfrak{g}}{\mathfrak{h}}fraktur_g fraktur_h.

Group Actions and Group Representations

Rather than considering groups as abstract entities, we are mostly interested in how groups act on data. Since we assumed that there is some domain ΩΩ\Omegaroman_Ω underlying our data, we will study how the group acts on ΩΩ\Omegaroman_Ω (e.g. translation of points of the plane), and from there obtain actions of the same group on the space of signals 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) (e.g. translations of planar images and feature maps).

A group action 202020Technically, what we define here is a left group action.of 𝔊𝔊{\mathfrak{G}}fraktur_G on a set ΩΩ\Omegaroman_Ω is defined as a mapping (𝔤,u)𝔤.uformulae-sequencemaps-to𝔤𝑢𝔤𝑢({\mathfrak{g}},u)\mapsto{\mathfrak{g}}.u( fraktur_g , italic_u ) ↦ fraktur_g . italic_u associating a group element 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G and a point uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω with some other point on ΩΩ\Omegaroman_Ω in a way that is compatible with the group operations, i.e., 𝔤.(𝔥.u)=(𝔤𝔥).u{\mathfrak{g}}.({\mathfrak{h}}.u)=({\mathfrak{g}}{\mathfrak{h}}).ufraktur_g . ( fraktur_h . italic_u ) = ( fraktur_g fraktur_h ) . italic_u for all 𝔤,𝔥𝔊𝔤𝔥𝔊{\mathfrak{g}},{\mathfrak{h}}\in{\mathfrak{G}}fraktur_g , fraktur_h ∈ fraktur_G and uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω. We shall see numerous instances of group actions in the following sections. For example, in the plane the Euclidean group E(2)E2\mathrm{E}(2)roman_E ( 2 ) is the group of transformations of 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that preserves Euclidean distances212121Distance-preserving transformations are called isometries. According to Klein’s Erlangen Programme, the classical Euclidean geometry arises from this group., and consists of translations, rotations, and reflections. The same group, however, can also act on the space of images on the plane (by translating, rotating and flipping the grid of pixels), as well as on the representation spaces learned by a neural network. More precisely, if we have a group 𝔊𝔊{\mathfrak{G}}fraktur_G acting on ΩΩ\Omegaroman_Ω, we automatically obtain an action of 𝔊𝔊{\mathfrak{G}}fraktur_G on the space 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ):

(𝔤.x)(u)=x(𝔤1u).({\mathfrak{g}}.x)(u)=x({\mathfrak{g}}^{-1}u).( fraktur_g . italic_x ) ( italic_u ) = italic_x ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u ) . (3)

Due to the inverse on 𝔤𝔤{\mathfrak{g}}fraktur_g, this is indeed a valid group action, in that we have (𝔤.(𝔥.x))(u)=((𝔤𝔥).x)(u)({\mathfrak{g}}.({\mathfrak{h}}.x))(u)=(({\mathfrak{g}}{\mathfrak{h}}).x)(u)( fraktur_g . ( fraktur_h . italic_x ) ) ( italic_u ) = ( ( fraktur_g fraktur_h ) . italic_x ) ( italic_u ).

The most important kind of group actions, which we will encounter repeatedly throughout this text, are linear group actions, also known as group representations. The action on signals in equation (3) is indeed linear, in the sense that

𝔤.(αx+βx)=α(𝔤.x)+β(𝔤.x){\mathfrak{g}}.(\alpha x+\beta x^{\prime})=\alpha({\mathfrak{g}}.x)+\beta({\mathfrak{g}}.x^{\prime})fraktur_g . ( italic_α italic_x + italic_β italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_α ( fraktur_g . italic_x ) + italic_β ( fraktur_g . italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

for any scalars α,β𝛼𝛽\alpha,\betaitalic_α , italic_β and signals x,x𝒳(Ω)𝑥superscript𝑥𝒳Ωx,x^{\prime}\in\mathcal{X}(\Omega)italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X ( roman_Ω ). We can describe linear actions either as maps (𝔤,x)𝔤.xformulae-sequencemaps-to𝔤𝑥𝔤𝑥({\mathfrak{g}},x)\mapsto{\mathfrak{g}}.x( fraktur_g , italic_x ) ↦ fraktur_g . italic_x that are linear in x𝑥xitalic_x, or equivalently, by currying, as a map ρ:𝔊n×n:𝜌𝔊superscript𝑛𝑛\rho:{\mathfrak{G}}\rightarrow\mathbb{R}^{n\times n}italic_ρ : fraktur_G → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT222222When ΩΩ\Omegaroman_Ω is infinte, the space of signals 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) is infinite dimensional, in which case ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ) is a linear operator on this space, rather than a finite dimensional matrix. In practice, one must always discretise to a finite grid, though.that assigns to each group element 𝔤𝔤{\mathfrak{g}}fraktur_g an (invertible) matrix ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ). The dimension n𝑛nitalic_n of the matrix is in general arbitrary and not necessarily related to the dimensionality of the group or the dimensionality of ΩΩ\Omegaroman_Ω, but in applications to deep learning n𝑛nitalic_n will usually be the dimensionality of the feature space on which the group acts. For instance, we may have the group of 2D translations acting on a space of images with n𝑛nitalic_n pixels.

As with a general group action, the assignment of matrices to group elements should be compatible with the group action. More specifically, the matrix representing a composite group element 𝔤𝔥𝔤𝔥{\mathfrak{g}}{\mathfrak{h}}fraktur_g fraktur_h should equal the matrix product of the representation of 𝔤𝔤{\mathfrak{g}}fraktur_g and 𝔥𝔥{\mathfrak{h}}fraktur_h:

A n𝑛nitalic_n-dimensional real representation of a group 𝔊𝔊{\mathfrak{G}}fraktur_G is a map ρ:𝔊n×n:𝜌𝔊superscript𝑛𝑛\rho:{\mathfrak{G}}\rightarrow\mathbb{R}^{n\times n}italic_ρ : fraktur_G → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, assigning to each 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G an invertible matrix ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ), and satisfying the condition ρ(𝔤𝔥)=ρ(𝔤)ρ(𝔥)𝜌𝔤𝔥𝜌𝔤𝜌𝔥\rho({\mathfrak{g}}{\mathfrak{h}})=\rho({\mathfrak{g}})\rho({\mathfrak{h}})italic_ρ ( fraktur_g fraktur_h ) = italic_ρ ( fraktur_g ) italic_ρ ( fraktur_h ) for all 𝔤,𝔥𝔊𝔤𝔥𝔊{\mathfrak{g}},{\mathfrak{h}}\in{\mathfrak{G}}fraktur_g , fraktur_h ∈ fraktur_G. 232323Similarly, a complex representation is a map ρ:𝔊n×n:𝜌𝔊superscript𝑛𝑛\rho:{\mathfrak{G}}\rightarrow\mathbb{C}^{n\times n}italic_ρ : fraktur_G → blackboard_C start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT satisfying the same equation.A representation is called unitary or orthogonal if the matrix ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ) is unitary or orthogonal for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G.

Written in the language of group representations, the action of 𝔊𝔊{\mathfrak{G}}fraktur_G on signals x𝒳(Ω)𝑥𝒳Ωx\in\mathcal{X}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ) is defined as ρ(𝔤)x(u)=x(𝔤1u)𝜌𝔤𝑥𝑢𝑥superscript𝔤1𝑢\rho({\mathfrak{g}})x(u)=x({\mathfrak{g}}^{-1}u)italic_ρ ( fraktur_g ) italic_x ( italic_u ) = italic_x ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u ). We again verify that

(ρ(𝔤)(ρ(𝔥)x))(u)=(ρ(𝔤𝔥)x)(u).𝜌𝔤𝜌𝔥𝑥𝑢𝜌𝔤𝔥𝑥𝑢(\rho({\mathfrak{g}})(\rho({\mathfrak{h}})x))(u)=(\rho({\mathfrak{g}}{\mathfrak{h}})x)(u).( italic_ρ ( fraktur_g ) ( italic_ρ ( fraktur_h ) italic_x ) ) ( italic_u ) = ( italic_ρ ( fraktur_g fraktur_h ) italic_x ) ( italic_u ) .
Refer to caption
Figure 5: Three spaces of interest in Geometric Deep Learning: the (physical) domain ΩΩ\Omegaroman_Ω, the space of signals 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ), and the hypothesis class (𝒳(Ω))𝒳Ω\mathcal{F}(\mathcal{X}(\Omega))caligraphic_F ( caligraphic_X ( roman_Ω ) ). Symmetries of the domain ΩΩ\Omegaroman_Ω (captured by the group 𝔊𝔊{\mathfrak{G}}fraktur_G) act on signals x𝒳(Ω)𝑥𝒳Ωx\in\mathcal{X}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ) through group representations ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ), imposing structure on the functions f(𝒳(Ω))𝑓𝒳Ωf\in\mathcal{F}(\mathcal{X}(\Omega))italic_f ∈ caligraphic_F ( caligraphic_X ( roman_Ω ) ) acting on such signals.

Invariant and Equivariant functions

The symmetry of the domain ΩΩ\Omegaroman_Ω underlying the signals 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) imposes structure on the function f𝑓fitalic_f defined on such signals. It turns out to be a powerful inductive bias, improving learning242424In general, f𝑓fitalic_f depends both on the signal an the domain, i.e., (𝒳(Ω),Ω)𝒳ΩΩ\mathcal{F}(\mathcal{X}(\Omega),\Omega)caligraphic_F ( caligraphic_X ( roman_Ω ) , roman_Ω ). We will often omit the latter dependency for brevity. efficiency by reducing the space of possible interpolants, (𝒳(Ω))𝒳Ω\mathcal{F}(\mathcal{X}(\Omega))caligraphic_F ( caligraphic_X ( roman_Ω ) ), to those which satisfy the symmetry priors. Two important cases we will be exploring in this text are invariant and equivariant functions.

A function f:𝒳(Ω)𝒴:𝑓𝒳Ω𝒴f:\mathcal{X}(\Omega)\rightarrow\mathcal{Y}italic_f : caligraphic_X ( roman_Ω ) → caligraphic_Y is 𝔊𝔊{\mathfrak{G}}fraktur_G-invariant if f(ρ(𝔤)x)=f(x)𝑓𝜌𝔤𝑥𝑓𝑥f(\rho({\mathfrak{g}})x)=f(x)italic_f ( italic_ρ ( fraktur_g ) italic_x ) = italic_f ( italic_x ) for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G and x𝒳(Ω)𝑥𝒳Ωx\in\mathcal{X}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ), i.e., its output is unaffected by the group action on the input.

A classical example of invariance is shift-invariance,252525Note that signal processing books routinely use the term ‘shift-invariance’ referring to shift-equivariance, e.g. Linear Shift-invariant Systems. arising in computer vision and pattern recognition applications such as image classification. The function f𝑓fitalic_f in this case (typically implemented as a Convolutional Neural Network) inputs an image and outputs the probability of the image to contain an object from a certain class (e.g. cat or dog). It is often reasonably assumed that the classification result should not be affected by the position of the object in the image, i.e., the function f𝑓fitalic_f must be shift-invariant. Multi-layer Perceptrons, which can approximate any smooth function, do not have this property – one of the reasons why early attempts to apply these architectures to problems of pattern recognition in the 1970s failed. The development of neural network architectures with local weight sharing, as epitomised by Convolutional Neural Networks, was, among other reasons, motivated by the need for shift-invariant object classification.

If we however take a closer look at the convolutional layers of CNNs, we will find that they are not shift-invariant but shift-equivariant: in other words, a shift of the input to a convolutional layer produces a shift in the output feature maps by the same amount.

A function f:𝒳(Ω)𝒳(Ω):𝑓𝒳Ω𝒳Ωf:\mathcal{X}(\Omega)\rightarrow\mathcal{X}(\Omega)italic_f : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω ) is 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariant if262626 More generally, we might have f:𝒳(Ω)𝒳(Ω):𝑓𝒳Ω𝒳superscriptΩf:\mathcal{X}(\Omega)\rightarrow\mathcal{X}(\Omega^{\prime})italic_f : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with input and output spaces having different domains Ω,ΩΩsuperscriptΩ\Omega,\Omega^{\prime}roman_Ω , roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and representations ρ𝜌\rhoitalic_ρ, ρsuperscript𝜌\rho^{\prime}italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the same group 𝔊𝔊{\mathfrak{G}}fraktur_G. In this case, equivariance is defined as f(ρ(𝔤)x)=ρ(𝔤)f(x)𝑓𝜌𝔤𝑥superscript𝜌𝔤𝑓𝑥f(\rho({\mathfrak{g}})x)=\rho^{\prime}({\mathfrak{g}})f(x)italic_f ( italic_ρ ( fraktur_g ) italic_x ) = italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( fraktur_g ) italic_f ( italic_x ). f(ρ(𝔤)x)=ρ(𝔤)f(x)𝑓𝜌𝔤𝑥𝜌𝔤𝑓𝑥f(\rho({\mathfrak{g}})x)=\rho({\mathfrak{g}})f(x)italic_f ( italic_ρ ( fraktur_g ) italic_x ) = italic_ρ ( fraktur_g ) italic_f ( italic_x ) for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G, i.e., group action on the input affects the output in the same way.

Resorting again to computer vision, a prototypical application requiring shift-equivariance is image segmentation, where the output of f𝑓fitalic_f is a pixel-wise image mask. Obviously, the segmentation mask must follow shifts in the input image. In this example, the domains of the input and output are the same, but since the input has three color channels while the output has one channel per class, the representations (ρ,𝒳(Ω,𝒞))𝜌𝒳Ω𝒞(\rho,\mathcal{X}(\Omega,\mathcal{C}))( italic_ρ , caligraphic_X ( roman_Ω , caligraphic_C ) ) and (ρ,𝒳(Ω,𝒞))superscript𝜌𝒳Ωsuperscript𝒞(\rho^{\prime},\mathcal{X}(\Omega,\mathcal{C}^{\prime}))( italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_X ( roman_Ω , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) are somewhat different.

However, even the previous use case of image classification is usually implemented as a sequence of convolutional (shift-equivariant) layers, followed by global pooling (which is shift-invariant). As we will see in Section 3.5, this is a general blueprint of a majority of deep learning architectures, including CNNs and Graph Neural Networks (GNNs).

3.2 Isomorphisms and Automorphisms

Subgroups and Levels of structure

As mentioned before, a symmetry272727Invertible and structure-preserving maps between different objects often go under the generic name of isomorphisms (Greek for ‘equal shape’). An isomorphism from an object to itself is called an automorphism, or symmetry.is a transformation that preserves some property or structure, and the set of all such transformations for a given structure forms a symmetry group. It happens often that there is not one but multiple structures of interest, and so we can consider several levels of structure on our domain ΩΩ\Omegaroman_Ω. Hence, what counts as a symmetry depends on the structure under consideration, but in all cases a symmetry is an invertible map that respects this structure.

On the most basic level, the domain ΩΩ\Omegaroman_Ω is a set, which has a minimal amount of structure: all we can say is that the set has some cardinality282828For a finite set, the cardinality is the number of elements (‘size’) of the set, and for infinite sets the cardinality indicates different kinds of infinities, such as the countable infinity of the natural numbers, or the uncountable infinity of the continuum \mathbb{R}blackboard_R.. Self-maps that preserve this structure are bijections (invertible maps), which we may consider as set-level symmetries. One can easily verify that this is a group by checking the axioms: a compositions of two bijections is also a bijection (closure), the associativity stems from the associativity of the function composition, the map τ(u)=u𝜏𝑢𝑢\tau(u)=uitalic_τ ( italic_u ) = italic_u is the identity element, and for every τ𝜏\tauitalic_τ the inverse exists by definition, satisfying (ττ1)(u)=(τ1τ)(u)=u𝜏superscript𝜏1𝑢superscript𝜏1𝜏𝑢𝑢(\tau\circ\tau^{-1})(u)=(\tau^{-1}\circ\tau)(u)=u( italic_τ ∘ italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ( italic_u ) = ( italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∘ italic_τ ) ( italic_u ) = italic_u.

Depending on the application, there may be further levels of structure. For instance, if ΩΩ\Omegaroman_Ω is a topological space, we can consider maps that preserve continuity: such maps are called homeomorphisms and in addition to simple bijections between sets, are also continuous and have continuous inverse. Intuitively, continuous functions are well-behaved and map points in a neighbourhood (open set) around a point u𝑢uitalic_u to a neighbourhood around τ(u)𝜏𝑢\tau(u)italic_τ ( italic_u ).

One can further demand that the map and its inverse are (continuously) differentiable,292929Every differentiable function is continuous. If the map is continuously differentiable ‘sufficiently many times’, it is said to be smooth.i.e., the map and its inverse have a derivative at every point (and the derivative is also continuous). This requires further differentiable structure that comes with differentiable manifolds, where such maps are called diffeomorphisms and denoted by Diff(Ω)DiffΩ\mathrm{Diff}(\Omega)roman_Diff ( roman_Ω ). Additional examples of structures we will encounter include distances or metrics (maps preserving them are called isometries) or orientation (to the best of our knowledge, orientation-preserving maps do not have a common Greek name).

A metric or distance is a function d:Ω×Ω[0,):𝑑ΩΩ0d:\Omega\times\Omega\rightarrow[0,\infty)italic_d : roman_Ω × roman_Ω → [ 0 , ∞ ) satisfying for all u,v,wΩ𝑢𝑣𝑤Ωu,v,w\in\Omegaitalic_u , italic_v , italic_w ∈ roman_Ω: Identity of indiscernibles: d(u,v)=0𝑑𝑢𝑣0d(u,v)=0italic_d ( italic_u , italic_v ) = 0 iff u=v𝑢𝑣u=vitalic_u = italic_v. Symmetry: d(u,v)=d(v,u)𝑑𝑢𝑣𝑑𝑣𝑢d(u,v)=d(v,u)italic_d ( italic_u , italic_v ) = italic_d ( italic_v , italic_u ). Triangle inequality: d(u,v)d(u,w)+d(w,v)𝑑𝑢𝑣𝑑𝑢𝑤𝑑𝑤𝑣d(u,v)\leq d(u,w)+d(w,v)italic_d ( italic_u , italic_v ) ≤ italic_d ( italic_u , italic_w ) + italic_d ( italic_w , italic_v ). A space equipped with a metric (Ω,d)Ω𝑑(\Omega,d)( roman_Ω , italic_d ) is called a metric space.

The right level of structure to consider depends on the problem. For example, when segmenting histopathology slide images, we may wish to consider flipped versions of an image as equivalent (as the sample can be flipped when put under the microscope), but if we are trying to classify road signs, we would only want to consider orientation-preserving transformations as symmetries (since reflections could change the meaning of the sign).

As we add levels of structure to be preserved, the symmetry group will get smaller. Indeed, adding structure is equivalent to selecting a subgroup, which is a subset of the larger group that satisfies the axioms of a group by itself:

Let (𝔊,)𝔊({\mathfrak{G}},\circ)( fraktur_G , ∘ ) be a group and 𝔊𝔊\mathfrak{H}\subseteq{\mathfrak{G}}fraktur_H ⊆ fraktur_G a subset. \mathfrak{H}fraktur_H is said to be a subgroup of 𝔊𝔊{\mathfrak{G}}fraktur_G if (,)(\mathfrak{H},\circ)( fraktur_H , ∘ ) constitutes a group with the same operation.

For instance, the group of Euclidean isometries E(2)E2\operatorname{E}(2)roman_E ( 2 ) is a subgroup of the group of planar diffeomorphisms Diff(2)Diff2\operatorname{Diff}(2)roman_Diff ( 2 ), and in turn the group of orientation-preserving isometries SE(2)SE2\operatorname{SE}(2)roman_SE ( 2 ) is a subgroup of E(2)E2\operatorname{E}(2)roman_E ( 2 ). This hierarchy of structure follows the Erlangen Programme philosophy outlined in the Preface: in Klein’s construction, the Projective, Affine, and Euclidean geometries have increasingly more invariants and correspond to progressively smaller groups.

Isomorphisms and Automorphisms

We have described symmetries as structure preserving and invertible maps from an object to itself. Such maps are also known as automorphisms, and describe a way in which an object is equivalent it itself. However, an equally important class of maps are the so-called isomorphisms, which exhibit an equivalence between two non-identical objects. These concepts are often conflated, but distinguishing them is necessary to create clarity for our following discussion.

To understand the difference, consider a set Ω={0,1,2}Ω012\Omega=\{0,1,2\}roman_Ω = { 0 , 1 , 2 }. An automorphism of the set ΩΩ\Omegaroman_Ω is a bijection τ:ΩΩ:𝜏ΩΩ\tau:\Omega\rightarrow\Omegaitalic_τ : roman_Ω → roman_Ω such as a cyclic shift τ(u)=u+1mod3𝜏𝑢modulo𝑢13\tau(u)=u+1\mod 3italic_τ ( italic_u ) = italic_u + 1 roman_mod 3. Such a map preserves the cardinality property, and maps ΩΩ\Omegaroman_Ω onto itself. If we have another set Ω={a,b,c}superscriptΩ𝑎𝑏𝑐\Omega^{\prime}=\{a,b,c\}roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_a , italic_b , italic_c } with the same number of elements, then a bijection η:ΩΩ:𝜂ΩsuperscriptΩ\eta:\Omega\rightarrow\Omega^{\prime}italic_η : roman_Ω → roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such as η(0)=a𝜂0𝑎\eta(0)=aitalic_η ( 0 ) = italic_a, η(1)=b𝜂1𝑏\eta(1)=bitalic_η ( 1 ) = italic_b, η(2)=c𝜂2𝑐\eta(2)=citalic_η ( 2 ) = italic_c is a set isomorphism.

As we will see in Section 4.1 for graphs, the notion of structure includes not just the number of nodes, but also the connectivity. An isomorphism η:𝒱𝒱:𝜂𝒱superscript𝒱\eta:\mathcal{V}\rightarrow\mathcal{V}^{\prime}italic_η : caligraphic_V → caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT between two graphs 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) and 𝒢=(𝒱,)superscript𝒢superscript𝒱superscript\mathcal{G}^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime})caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is thus a bijection between the nodes that maps pairs of connected nodes to pairs of connected nodes, and likewise for pairs of non-connected nodes.303030I.e., (η(u),η(v))𝒱𝜂𝑢𝜂𝑣superscript𝒱(\eta(u),\eta(v))\in\mathcal{V}^{\prime}( italic_η ( italic_u ) , italic_η ( italic_v ) ) ∈ caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT iff (u,v)𝒱𝑢𝑣𝒱(u,v)\in\mathcal{V}( italic_u , italic_v ) ∈ caligraphic_V. Two isomorphic graphs are thus structurally identical, and differ only in the way their nodes are ordered. 313131[Uncaptioned image]
The Folkman graph (Folkman, 1967) is a beautiful example of a graph with 3840 automorphisms, exemplified by the many symmetric ways to draw it.
On the other hand, a graph automorphism or symmetry is a map τ:𝒱𝒱:𝜏𝒱𝒱\tau:\mathcal{V}\rightarrow\mathcal{V}italic_τ : caligraphic_V → caligraphic_V maps the nodes of the graph back to itself, while preserving the connectivity. A graph with a non-trivial automorphism (i.e., τid𝜏id\tau\neq\mathrm{id}italic_τ ≠ roman_id) presents symmetries.

3.3 Deformation Stability

The symmetry formalism introduced in Sections 3.13.2 captures an idealised world where we know exactly which transformations are to be considered as symmetries, and we want to respect these symmetries exactly. For instance in computer vision, we might assume that planar translations are exact symmetries. However, the real world is noisy and this model falls short in two ways.

323232 [Uncaptioned image] Two objects moving at different velocities in a video define a transformation outside the translation group.

Firstly, while these simple groups provide a way to understand global symmetries of the domain ΩΩ\Omegaroman_Ω (and by extension, of signals on it, 𝒳(Ω)𝒳Ω{\mathcal{X}}(\Omega)caligraphic_X ( roman_Ω )), they do not capture local symmetries well. For instance, consider a video scene with several objects, each moving along its own different direction. At subsequent frames, the resulting scene will contain approximately the same semantic information, yet no global translation explains the transformation from one frame to another. In other cases, such as a deformable 3D object viewed by a camera, it is simply very hard to describe the group of transformations that preserve the object identity. These examples illustrate that in reality we are more interested in a far larger set of transformations where global, exact invariance is replaced by a local, inexact one. In our discussion, we will distinguish between two scenarios: the setting where the domain ΩΩ\Omegaroman_Ω is fixed, and signals x𝒳(Ω)𝑥𝒳Ωx\in{\mathcal{X}}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ) are undergoing deformations, and the setting where the domain ΩΩ\Omegaroman_Ω itself may be deformed.

Stability to signal deformations

In many applications, we know a priori that a small deformation of the signal x𝑥xitalic_x should not change the output of f(x)𝑓𝑥f(x)italic_f ( italic_x ), so it is tempting to consider such deformations as symmetries. For instance, we could view small diffeomorphisms τDiff(Ω)𝜏DiffΩ\tau\in\operatorname{Diff}(\Omega)italic_τ ∈ roman_Diff ( roman_Ω ), or even small bijections, as symmetries. However, small deformations can be composed to form large deformations, so “small deformations” do not form a group,333333E.g., the composition of two ϵitalic-ϵ\epsilonitalic_ϵ-isometries is a 2ϵ2italic-ϵ2\epsilon2 italic_ϵ-isometry, violating the closure property. and we cannot ask for invariance or equivariance to small deformations only. Since large deformations can can actually materially change the semantic content of the input, it is not a good idea to use the full group Diff(Ω)DiffΩ\operatorname{Diff}(\Omega)roman_Diff ( roman_Ω ) as symmetry group either.

A better approach is to quantify how “far” a given τDiff(Ω)𝜏DiffΩ\tau\in\operatorname{Diff}(\Omega)italic_τ ∈ roman_Diff ( roman_Ω ) is from a given symmetry subgroup 𝔊Diff(Ω)𝔊DiffΩ{\mathfrak{G}}\subset\operatorname{Diff}(\Omega)fraktur_G ⊂ roman_Diff ( roman_Ω ) (e.g. translations) with a complexity measure c(τ)𝑐𝜏c(\tau)italic_c ( italic_τ ), so that c(τ)=0𝑐𝜏0c(\tau)=0italic_c ( italic_τ ) = 0 whenever τ𝔊𝜏𝔊\tau\in{\mathfrak{G}}italic_τ ∈ fraktur_G. We can now replace our previous definition of exact invariance and equivarance under group actions with a ‘softer’ notion of deformation stability (or approximate invariance):

f(ρ(τ)x)f(x)Cc(τ)x,,x𝒳(Ω)\|f(\rho(\tau)x)-f(x)\|\leq Cc(\tau)\|x\|,~{},~{}\forall x\in{\mathcal{X}}(\Omega)∥ italic_f ( italic_ρ ( italic_τ ) italic_x ) - italic_f ( italic_x ) ∥ ≤ italic_C italic_c ( italic_τ ) ∥ italic_x ∥ , , ∀ italic_x ∈ caligraphic_X ( roman_Ω ) (4)

where ρ(τ)x(u)=x(τ1u)𝜌𝜏𝑥𝑢𝑥superscript𝜏1𝑢\rho(\tau)x(u)=x(\tau^{-1}u)italic_ρ ( italic_τ ) italic_x ( italic_u ) = italic_x ( italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u ) as before, and where C𝐶Citalic_C is some constant independent of the signal x𝑥xitalic_x. A function f(𝒳(Ω))𝑓𝒳Ωf\in\mathcal{F}(\mathcal{X}(\Omega))italic_f ∈ caligraphic_F ( caligraphic_X ( roman_Ω ) ) satisfying the above equation is said to be geometrically stable. We will see examples of such functions in the next Section 3.4.

Since c(τ)=0𝑐𝜏0c(\tau)=0italic_c ( italic_τ ) = 0 for τ𝔊𝜏𝔊\tau\in{\mathfrak{G}}italic_τ ∈ fraktur_G, this definition generalises the 𝔊𝔊{\mathfrak{G}}fraktur_G-invariance property defined above. Its utility in applications depends on introducing an appropriate deformation cost. In the case of images defined over a continuous Euclidean plane, a popular choice is c2(τ):=Ωτ(u)2duassignsuperscript𝑐2𝜏subscriptΩsuperscriptnorm𝜏𝑢2differential-d𝑢c^{2}(\tau):=\int_{\Omega}\|\nabla\tau(u)\|^{2}\mathrm{d}uitalic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_τ ) := ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ ∇ italic_τ ( italic_u ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_u, which measures the ‘elasticity’ of τ𝜏\tauitalic_τ, i.e., how different it is from the displacement by a constant vector field. This deformation cost is in fact a norm often called the Dirichlet energy, and can be used to quantify how far τ𝜏\tauitalic_τ is from the translation group.

Refer to caption
Figure 6: The set of all bijective mappings from ΩΩ\Omegaroman_Ω into itself forms the set automorphism group Aut(Ω)AutΩ\mathrm{Aut}(\Omega)roman_Aut ( roman_Ω ), of which a symmetry group 𝔊𝔊{\mathfrak{G}}fraktur_G (shown as a circle) is a subgroup. Geometric Stability extends the notion of 𝔊𝔊{\mathfrak{G}}fraktur_G-invariance and equivariance to ‘transformations around 𝔊𝔊{\mathfrak{G}}fraktur_G’ (shown as gray ring), quantified in the sense of some metric between transformations. In this example, a smooth distortion of the image is close to a shift.

Stability to domain deformations

In many applications, the object being deformed is not the signal, but the geometric domain ΩΩ\Omegaroman_Ω itself. Canonical instances of this are applications dealing with graphs and manifolds: a graph can model a social network at different instance of time containing slightly different social relations (follow graph), or a manifold can model a 3D object undergoing non-rigid deformations. This deformation can be quantified as follows. If 𝒟𝒟\mathcal{D}caligraphic_D denotes the space of all possible variable domains (such as the space of all graphs, or the space of Riemannian manifolds), one can define for Ω,Ω~𝒟Ω~Ω𝒟\Omega,\tilde{\Omega}\in\mathcal{D}roman_Ω , over~ start_ARG roman_Ω end_ARG ∈ caligraphic_D an appropriate metric (‘distance’) d(Ω,Ω~)𝑑Ω~Ωd(\Omega,\tilde{\Omega})italic_d ( roman_Ω , over~ start_ARG roman_Ω end_ARG ) satisfying d(Ω,Ω~)=0𝑑Ω~Ω0d(\Omega,\tilde{\Omega})=0italic_d ( roman_Ω , over~ start_ARG roman_Ω end_ARG ) = 0 if ΩΩ\Omegaroman_Ω and Ω~~Ω\tilde{\Omega}over~ start_ARG roman_Ω end_ARG are equivalent in some sense: for example, the graph edit distance vanishes when the graphs are isomorphic, and the Gromov-Hausdorff distance between Riemannian manifolds equipped with geodesic distances vanishes when two manifolds are isometric.343434The graph edit distance measures the minimal cost of making two graphs isomorphic by a sequences of graph edit operations. The Gromov-Hausdorff distance measures the smallest possible metric distortion of a correspondence between two metric spaces, see Gromov (1981).

A common construction of such distances between domains relies on some family of invertible mapping η:ΩΩ~:𝜂Ω~Ω\eta:\Omega\to\tilde{\Omega}italic_η : roman_Ω → over~ start_ARG roman_Ω end_ARG that try to ‘align’ the domains in a way that the corresponding structures are best preserved. For example, in the case of graphs or Riemannian manifolds (regarded as metric spaces with the geodesic distance), this alignment can compare pair-wise adjacency or distance structures (d𝑑ditalic_d and d~~𝑑\tilde{d}over~ start_ARG italic_d end_ARG, respectively),

d𝒟(Ω,Ω~)=infη𝔊dd~(η×η)subscript𝑑𝒟Ω~Ωsubscriptinfimum𝜂𝔊norm𝑑~𝑑𝜂𝜂d_{{\mathcal{D}}}(\Omega,\tilde{\Omega})=\inf_{\eta\in{\mathfrak{G}}}\|d-\tilde{d}\circ(\eta\times\eta)\|italic_d start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( roman_Ω , over~ start_ARG roman_Ω end_ARG ) = roman_inf start_POSTSUBSCRIPT italic_η ∈ fraktur_G end_POSTSUBSCRIPT ∥ italic_d - over~ start_ARG italic_d end_ARG ∘ ( italic_η × italic_η ) ∥

where 𝔊𝔊{\mathfrak{G}}fraktur_G is the group of isomorphisms such as bijections or isometries, and the norm is defined over the product space Ω×ΩΩΩ\Omega\times\Omegaroman_Ω × roman_Ω. In other words, a distance between elements of Ω,Ω~Ω~Ω\Omega,\tilde{\Omega}roman_Ω , over~ start_ARG roman_Ω end_ARG is ‘lifted’ to a distance between the domains themselves, by accounting for all the possible alignments that preserve the internal structure. 353535Two graphs can be aligned by the Quadratic Assignment Problem (QAP), which considers in its simplest form two graphs G,G~𝐺~𝐺G,\tilde{G}italic_G , over~ start_ARG italic_G end_ARG of the same size n𝑛nitalic_n, and solves min𝐏Σntrace(𝐀𝐏𝐀~𝐏)subscript𝐏subscriptΣ𝑛trace𝐀𝐏~𝐀superscript𝐏top\min_{\mathbf{P}\in\Sigma_{n}}\mathrm{trace}(\mathbf{AP\tilde{A}P}^{\top})roman_min start_POSTSUBSCRIPT bold_P ∈ roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_trace ( bold_AP over~ start_ARG bold_A end_ARG bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), where 𝐀,𝐀~𝐀~𝐀\mathbf{A},\tilde{\mathbf{A}}bold_A , over~ start_ARG bold_A end_ARG are the respective adjacency matrices and ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the group of n×n𝑛𝑛n\times nitalic_n × italic_n permutation matrices. The graph edit distance can be associated with such QAP (Bougleux et al., 2015). Given a signal x𝒳(Ω)𝑥𝒳Ωx\in{\mathcal{X}}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ) and a deformed domain Ω~~Ω\tilde{\Omega}over~ start_ARG roman_Ω end_ARG, one can then consider the deformed signal x~=xη1𝒳(Ω~)~𝑥𝑥superscript𝜂1𝒳~Ω\tilde{x}=x\circ\eta^{-1}\in{\mathcal{X}}(\tilde{\Omega})over~ start_ARG italic_x end_ARG = italic_x ∘ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∈ caligraphic_X ( over~ start_ARG roman_Ω end_ARG ).

By slightly abusing the notation, we define 𝒳(𝒟)={(𝒳(Ω),Ω):Ω𝒟}𝒳𝒟conditional-set𝒳ΩΩΩ𝒟{\mathcal{X}}(\mathcal{D})=\{({\mathcal{X}}(\Omega),\Omega)\,:\,\Omega\in\mathcal{D}\}caligraphic_X ( caligraphic_D ) = { ( caligraphic_X ( roman_Ω ) , roman_Ω ) : roman_Ω ∈ caligraphic_D } as the ensemble of possible input signals defined over a varying domain. A function f:𝒳(𝒟)𝒴:𝑓𝒳𝒟𝒴f:{\mathcal{X}}(\mathcal{D})\to{\mathcal{Y}}italic_f : caligraphic_X ( caligraphic_D ) → caligraphic_Y is stable to domain deformations if

f(x,Ω)f(x~,Ω~)Cxd𝒟(Ω,Ω~)norm𝑓𝑥Ω𝑓~𝑥~Ω𝐶norm𝑥subscript𝑑𝒟Ω~Ω\|f(x,\Omega)-f(\tilde{x},\tilde{\Omega})\|\leq C\|x\|d_{{\mathcal{D}}}(\Omega,\tilde{\Omega})~{}∥ italic_f ( italic_x , roman_Ω ) - italic_f ( over~ start_ARG italic_x end_ARG , over~ start_ARG roman_Ω end_ARG ) ∥ ≤ italic_C ∥ italic_x ∥ italic_d start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( roman_Ω , over~ start_ARG roman_Ω end_ARG ) (5)

for all Ω,Ω~𝒟Ω~Ω𝒟\Omega,\tilde{\Omega}\in\mathcal{D}roman_Ω , over~ start_ARG roman_Ω end_ARG ∈ caligraphic_D, and x𝒳(Ω)𝑥𝒳Ωx\in\mathcal{X}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ). We will discuss this notion of stability in the context of manifolds in Sections 4.44.6, where isometric deformations play a crucial role. Furthermore, it can be shown that the stability to domain deformations is a natural generalisation of the stability to signal deformations, by viewing the latter in terms of deformations of the volume form Gama et al. (2019).

3.4 Scale Separation

While deformation stability substantially strengthens the global symmetry priors, it is not sufficient in itself to overcome the curse of dimensionality, in the sense that, informally speaking, there are still “too many" functions that respect (4) as the size of the domain grows. A key insight to overcome this curse is to exploit the multiscale structure of physical tasks. Before describing multiscale representations, we need to introduce the main elements of Fourier transforms, which rely on frequency rather than scale.

Fourier Transform and Global invariants

Arguably 363636[Uncaptioned image] Fourier basis functions have global support. As a result, local signals produce energy across all frequencies. the most famous signal decomposition is the Fourier transform, the cornerstone of harmonic analysis. The classical one-dimensional Fourier transform

x^(ξ)=+x(u)eiξudu^𝑥𝜉superscriptsubscript𝑥𝑢superscript𝑒i𝜉𝑢differential-d𝑢\hat{x}(\xi)=\int_{-\infty}^{+\infty}x(u)e^{-\mathrm{i}\xi u}\mathrm{d}uover^ start_ARG italic_x end_ARG ( italic_ξ ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_x ( italic_u ) italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_u end_POSTSUPERSCRIPT roman_d italic_u

expresses the function x(u)L2(Ω)𝑥𝑢superscript𝐿2Ωx(u)\in L^{2}(\Omega)italic_x ( italic_u ) ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Ω ) on the domain Ω=Ω\Omega=\mathbb{R}roman_Ω = blackboard_R as a linear combination of orthogonal oscillating basis functions φξ(u)=eiξusubscript𝜑𝜉𝑢superscript𝑒i𝜉𝑢\varphi_{\xi}(u)=e^{\mathrm{i}\xi u}italic_φ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_u ) = italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT, indexed by their rate of oscillation (or frequency) ξ𝜉\xiitalic_ξ. Such an organisation into frequencies reveals important information about the signal, e.g. its smoothness and localisation. The Fourier basis itself has a deep geometric foundation and can be interpreted as the natural vibrations of the domain, related to its geometric structure (see e.g. Berger (2012)).

The Fourier transform373737In the following, we will use convolution and (cross-)correlation (xθ)(u)=+x(v)θ(u+v)dv𝑥𝜃𝑢superscriptsubscript𝑥𝑣𝜃𝑢𝑣differential-d𝑣(x\,\star\,\theta)(u)=\int_{-\infty}^{+\infty}x(v)\theta(u+v)\mathrm{d}v( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_x ( italic_v ) italic_θ ( italic_u + italic_v ) roman_d italic_v interchangeably, as it is common in machine learning: the difference between the two is whether the filter is reflected, and since the filter is typically learnable, the distinction is purely notational. plays a crucial role in signal processing as it offers a dual formulation of convolution,

(xθ)(u)=+x(v)θ(uv)dv𝑥𝜃𝑢superscriptsubscript𝑥𝑣𝜃𝑢𝑣differential-d𝑣(x\star\theta)(u)=\int_{-\infty}^{+\infty}x(v)\theta(u-v)\mathrm{d}v( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_x ( italic_v ) italic_θ ( italic_u - italic_v ) roman_d italic_v

a standard model of linear signal filtering (here and in the following, x𝑥xitalic_x denotes the signal and θ𝜃\thetaitalic_θ the filter). As we will show in the following, the convolution operator is diagonalised in the Fourier basis, making it possible to express convolution as the product of the respective Fourier transforms,

(xθ)^(ξ)=x^(ξ)θ^(ξ),^𝑥𝜃𝜉^𝑥𝜉^𝜃𝜉\widehat{(x\star\theta)}(\xi)=\hat{x}(\xi)\cdot\hat{\theta}(\xi),over^ start_ARG ( italic_x ⋆ italic_θ ) end_ARG ( italic_ξ ) = over^ start_ARG italic_x end_ARG ( italic_ξ ) ⋅ over^ start_ARG italic_θ end_ARG ( italic_ξ ) ,

a fact known in signal processing as the Convolution Theorem.

As it turns out, many fundamental differential operators such as the Laplacian are described as convolutions on Euclidean domains. Since such differential operators can be defined intrinsically over very general geometries, this provides a formal procedure to extend Fourier transforms beyond Euclidean domains, including graphs, groups and manifolds. We will discuss this in detail in Section 4.4.

An essential aspect of Fourier transforms is that they reveal global properties of the signal and the domain, such as smoothness or conductance. Such global behavior is convenient in presence of global symmetries of the domain such as translation, but not to study more general diffeomorphisms. This requires a representation that trades off spatial and frequential localisation, as we see next.

Multiscale representations

The notion of local invariance can be articulated by switching from a Fourier frequency-based representation to a scale-based representation, the cornerstone of multi-scale decomposition methods such as wavelets.383838See Mallat (1999) for a comperehensive introduction. The essential insight of multi-scale methods is to decompose functions defined over the domain ΩΩ\Omegaroman_Ω into elementary functions that are localised both in space and frequency.393939 [Uncaptioned image] Contrary to Fourier, wavelet atoms are localised and multi-scale, allowing to capture fine details of the signal with atoms having small spatial support and coarse details with atoms having large spatial support. The term atom here is synonymous with ‘basis element’ in Fourier analysis, with the caveat that wavelets are redundant (over-complete). In the case of wavelets, this is achieved by correlating a translated and dilated filter (mother wavelet) ψ𝜓\psiitalic_ψ, producing a combined spatio-frequency representation called a continuous wavelet transform

(Wψx)(u,ξ)=ξ1/2+ψ(vuξ)x(v)dv.subscript𝑊𝜓𝑥𝑢𝜉superscript𝜉12superscriptsubscript𝜓𝑣𝑢𝜉𝑥𝑣differential-d𝑣(W_{\psi}x)(u,\xi)=\xi^{-1/2}\int_{-\infty}^{+\infty}\psi\left(\frac{v-u}{\xi}\right)x(v)\mathrm{d}v.( italic_W start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_x ) ( italic_u , italic_ξ ) = italic_ξ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_ψ ( divide start_ARG italic_v - italic_u end_ARG start_ARG italic_ξ end_ARG ) italic_x ( italic_v ) roman_d italic_v .

The translated and dilated filters are called wavelet atoms; their spatial position and dilation correspond to the coordinates u𝑢uitalic_u and ξ𝜉\xiitalic_ξ of the wavelet transform. These coordinates are usually sampled dyadically (ξ=2j𝜉superscript2𝑗\xi=2^{-j}italic_ξ = 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT and u=2jk𝑢superscript2𝑗𝑘u=2^{-j}kitalic_u = 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT italic_k), with j𝑗jitalic_j referred to as scale. Multi-scale signal representations bring important benefits in terms of capturing regularity properties beyond global smoothness, such as piece-wise smoothness, which made them a popular tool in signal and image processing and numerical analysis in the 90s.

Deformation stability of Multiscale representations:

The benefit of multiscale localised wavelet decompositions over Fourier decompositions is revealed when considering the effect of small deformations ‘nearby’ the underlying symmetry group. Let us illustrate this important concept in the Euclidean domain and the translation group. Since the Fourier representation diagonalises the shift operator (which can be thought of as convolution, as we will see in more detail in Section 4.2), it is an efficient representation for translation transformations. However, Fourier decompositions are unstable under high-frequency deformations. In contrast, wavelet decompositions offer a stable representation in such cases.

Indeed, let us consider τAut(Ω)𝜏AutΩ\tau\in\mathrm{Aut}(\Omega)italic_τ ∈ roman_Aut ( roman_Ω ) and its associated linear representation ρ(τ)𝜌𝜏\rho(\tau)italic_ρ ( italic_τ ). When τ(u)=uv𝜏𝑢𝑢𝑣\tau(u)=u-vitalic_τ ( italic_u ) = italic_u - italic_v is a shift, as we will verify in Section 4.2, the operator ρ(τ)=Sv𝜌𝜏subscript𝑆𝑣\rho(\tau)=S_{v}italic_ρ ( italic_τ ) = italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a shift operator that commutes with convolution. Since convolution operators are diagonalised by the Fourier transform, the action of shift in the frequency domain amounts to shifting the complex phase of the Fourier transform,

(Svx^)(ξ)=eiξvx^(ξ).^subscript𝑆𝑣𝑥𝜉superscript𝑒i𝜉𝑣^𝑥𝜉(\widehat{S_{v}x})(\xi)=e^{-\mathrm{i}\xi v}\hat{x}(\xi).( over^ start_ARG italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x end_ARG ) ( italic_ξ ) = italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG ( italic_ξ ) .

Thus, the Fourier modulus f(x)=|x^|𝑓𝑥^𝑥f(x)=|\hat{x}|italic_f ( italic_x ) = | over^ start_ARG italic_x end_ARG | removing the complex phase is a simple shift-invariant function, f(Svx)=f(x)𝑓subscript𝑆𝑣𝑥𝑓𝑥f(S_{v}x)=f(x)italic_f ( italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x ) = italic_f ( italic_x ). However, if we have only approximate translation, τ(u)=uτ~(u)𝜏𝑢𝑢~𝜏𝑢\tau(u)=u-\tilde{\tau}(u)italic_τ ( italic_u ) = italic_u - over~ start_ARG italic_τ end_ARG ( italic_u ) with τ=supuΩτ~(u)ϵsubscriptnorm𝜏subscriptsupremum𝑢Ωnorm~𝜏𝑢italic-ϵ\|\nabla\tau\|_{\infty}=\sup_{u\in\Omega}\|\nabla\tilde{\tau}(u)\|\leq\epsilon∥ ∇ italic_τ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_u ∈ roman_Ω end_POSTSUBSCRIPT ∥ ∇ over~ start_ARG italic_τ end_ARG ( italic_u ) ∥ ≤ italic_ϵ, the situation is entirely different: it is possible to show that

f(ρ(τ)x)f(x)x=𝒪(1)norm𝑓𝜌𝜏𝑥𝑓𝑥norm𝑥𝒪1\frac{\|f(\rho(\tau)x)-f(x)\|}{\|x\|}=\mathcal{O}(1)divide start_ARG ∥ italic_f ( italic_ρ ( italic_τ ) italic_x ) - italic_f ( italic_x ) ∥ end_ARG start_ARG ∥ italic_x ∥ end_ARG = caligraphic_O ( 1 )

irrespective of how small ϵitalic-ϵ\epsilonitalic_ϵ is (i.e., how close is τ𝜏\tauitalic_τ to being a shift). Consequently, such Fourier representation is unstable under deformations, however small. This unstability is manifested in general domains and non-rigid transformations; we will see another instance of this unstability in the analysis of 3d shapes using the natural extension of Fourier transforms described in Section 4.4.

Wavelets offer a remedy to this problem that also reveals the power of multi-scale representations. In the above example, we can show (Mallat, 2012) that the wavelet decomposition Wψxsubscript𝑊𝜓𝑥W_{\psi}xitalic_W start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_x is approximately equivariant to deformations,

ρ(τ)(Wψx)Wψ(ρ(τ)x)x=𝒪(ϵ).norm𝜌𝜏subscript𝑊𝜓𝑥subscript𝑊𝜓𝜌𝜏𝑥norm𝑥𝒪italic-ϵ\frac{\|\rho(\tau)(W_{\psi}x)-W_{\psi}(\rho(\tau)x)\|}{\|x\|}=\mathcal{O}(\epsilon).divide start_ARG ∥ italic_ρ ( italic_τ ) ( italic_W start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_x ) - italic_W start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ρ ( italic_τ ) italic_x ) ∥ end_ARG start_ARG ∥ italic_x ∥ end_ARG = caligraphic_O ( italic_ϵ ) .

In other words, decomposing the signal information into scales using localised filters rather than frequencies turns a global unstable representation into a family of locally stable features. Importantly, such measurements at different scales are not yet invariant, and need to be progressively processed towards the low frequencies, hinting at the deep compositional nature of modern neural networks, and captured in our Blueprint for Geometric Deep Learning, presented next.

Scale Separation Prior:

We can build from this insight by considering a multiscale coarsening of the data domain ΩΩ\Omegaroman_Ω into a hierarchy Ω1,,ΩJsubscriptΩ1subscriptΩ𝐽\Omega_{1},\ldots,\Omega_{J}roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Ω start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. As it turns out, such coarsening can be defined on very general domains, including grids, graphs, and manifolds. Informally, a coarsening assimilates nearby points u,uΩ𝑢superscript𝑢Ωu,u^{\prime}\in\Omegaitalic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω together, and thus only requires an appropriate notion of metric in the domain. If 𝒳j(Ωj,𝒞j):={xj:Ωj𝒞j}assignsubscript𝒳𝑗subscriptΩ𝑗subscript𝒞𝑗conditional-setsubscript𝑥𝑗subscriptΩ𝑗subscript𝒞𝑗{\mathcal{X}}_{j}(\Omega_{j},\mathcal{C}_{j}):=\{x_{j}:\Omega_{j}\to\mathcal{C}_{j}\}caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } denotes signals defined over the coarsened domain ΩjsubscriptΩ𝑗\Omega_{j}roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we informally say that a function f:𝒳(Ω)𝒴:𝑓𝒳Ω𝒴f:{\mathcal{X}}(\Omega)\to{\mathcal{Y}}italic_f : caligraphic_X ( roman_Ω ) → caligraphic_Y is locally stable at scale j𝑗jitalic_j if it admits a factorisation of the form ffjPj𝑓subscript𝑓𝑗subscript𝑃𝑗f\approx f_{j}\circ P_{j}italic_f ≈ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∘ italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where Pj:𝒳(Ω)𝒳j(Ωj):subscript𝑃𝑗𝒳Ωsubscript𝒳𝑗subscriptΩ𝑗P_{j}:{\mathcal{X}}(\Omega)\to{\mathcal{X}}_{j}(\Omega_{j})italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : caligraphic_X ( roman_Ω ) → caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a non-linear coarse graining and fj:𝒳j(Ωj)𝒴:subscript𝑓𝑗subscript𝒳𝑗subscriptΩ𝑗𝒴f_{j}:{\mathcal{X}}_{j}(\Omega_{j})\to{\mathcal{Y}}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) → caligraphic_Y. In other words, while the target function f𝑓fitalic_f might depend on complex long-range interactions between features over the whole domain, in locally-stable functions it is possible to separate the interactions across scales, by first focusing on localised interactions that are then propagated towards the coarse scales.

Refer to caption
Figure 7: Illustration of Scale Separation for image classification tasks. The classifier fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT defined on signals on the coarse grid 𝒳(Ω)𝒳superscriptΩ\mathcal{X}(\Omega^{\prime})caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) should satisfy ffP𝑓superscript𝑓𝑃f\approx f^{\prime}\circ Pitalic_f ≈ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∘ italic_P, where P:𝒳(Ω)𝒳(Ω):𝑃𝒳Ω𝒳superscriptΩP:\mathcal{X}(\Omega)\rightarrow\mathcal{X}(\Omega^{\prime})italic_P : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Such principles414141Fast Multipole Method (FMM) is a numerical technique originally developed to speed up the calculation of long-ranged forces in n𝑛nitalic_n-body problems. FMM groups sources that lie close together and treats them as a single source. are of fundamental importance in many areas of physics and mathematics, as manifested for instance in statistical physics in the so-called renormalisation group, or leveraged in important numerical algorithms such as the Fast Multipole Method. In machine learning, multiscale representations and local invariance are the fundamental mathematical principles underpinning the efficiency of Convolutional Neural Networks and Graph Neural Networks and are typically implemented in the form of local pooling. In future work, we will further develop tools from computational harmonic analysis that unify these principles across our geometric domains and will shed light onto the statistical learning benefits of scale separation.

3.5 The Blueprint of Geometric Deep Learning

The geometric principles of Symmetry, Geometric Stability, and Scale Separation discussed in Sections 3.13.4 can be combined to provide a universal blueprint for learning stable representations of high-dimensional data. These representations will be produced by functions f𝑓fitalic_f operating on signals 𝒳(Ω,𝒞)𝒳Ω𝒞\mathcal{X}(\Omega,\mathcal{C})caligraphic_X ( roman_Ω , caligraphic_C ) defined on the domain ΩΩ\Omegaroman_Ω, which is endowed with a symmetry group 𝔊𝔊{\mathfrak{G}}fraktur_G.

The geometric priors we have described so far do not prescribe a specific architecture for building such representation, but rather a series of necessary conditions. However, they hint at an axiomatic construction that provably satisfies these geometric priors, while ensuring a highly expressive representation that can approximate any target function satisfying such priors.

A simple initial observation is that, in order to obtain a highly expressive representation, we are required to introduce a non-linear element, since if f𝑓fitalic_f is linear and 𝔊𝔊{\mathfrak{G}}fraktur_G-invariant, then for all x𝒳(Ω)𝑥𝒳Ωx\in{\mathcal{X}}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ), 424242Here, μ(𝔤)𝜇𝔤\mu({\mathfrak{g}})italic_μ ( fraktur_g ) is known as the Haar measure of the group 𝔊𝔊{\mathfrak{G}}fraktur_G, and the integral is performed over the entire group.

f(x)=1μ(𝔊)𝔊f(𝔤.x)dμ(𝔤)=f(1μ(𝔊)𝔊(𝔤.x)dμ(𝔤)),f(x)=\frac{1}{\mu({\mathfrak{G}})}\int_{{\mathfrak{G}}}f({\mathfrak{g}}.x)\mathrm{d}\mu({\mathfrak{g}})=f\left(\frac{1}{\mu({\mathfrak{G}})}\int_{{\mathfrak{G}}}({\mathfrak{g}}.x)\mathrm{d}\mu({\mathfrak{g}})\right),italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_μ ( fraktur_G ) end_ARG ∫ start_POSTSUBSCRIPT fraktur_G end_POSTSUBSCRIPT italic_f ( fraktur_g . italic_x ) roman_d italic_μ ( fraktur_g ) = italic_f ( divide start_ARG 1 end_ARG start_ARG italic_μ ( fraktur_G ) end_ARG ∫ start_POSTSUBSCRIPT fraktur_G end_POSTSUBSCRIPT ( fraktur_g . italic_x ) roman_d italic_μ ( fraktur_g ) ) ,

which indicates that F𝐹Fitalic_F only depends on x𝑥xitalic_x through the 𝔊𝔊{\mathfrak{G}}fraktur_G-average Ax=1μ(𝔊)𝔊(𝔤.x)dμ(𝔤)A{x}=\frac{1}{\mu({\mathfrak{G}})}\int_{{\mathfrak{G}}}({\mathfrak{g}}.x)\mathrm{d}\mu({\mathfrak{g}})italic_A italic_x = divide start_ARG 1 end_ARG start_ARG italic_μ ( fraktur_G ) end_ARG ∫ start_POSTSUBSCRIPT fraktur_G end_POSTSUBSCRIPT ( fraktur_g . italic_x ) roman_d italic_μ ( fraktur_g ). In the case of images and translation, this would entail using only the average RGB color of the input!

While this reasoning shows that the family of linear invariants is not a very rich object, the family of linear equivariants provides a much more powerful tool, since it enables the construction of rich and stable features by composition with appropriate non-linear maps, as we will now explain. Indeed, if B:𝒳(Ω,𝒞)𝒳(Ω,𝒞):𝐵𝒳Ω𝒞𝒳Ωsuperscript𝒞B:{\mathcal{X}}(\Omega,{\mathcal{C}})\to{\mathcal{X}}(\Omega,{\mathcal{C}}^{\prime})italic_B : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_X ( roman_Ω , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariant satisfying B(𝔤.x)=𝔤.B(x)B({\mathfrak{g}}.x)={\mathfrak{g}}.B(x)italic_B ( fraktur_g . italic_x ) = fraktur_g . italic_B ( italic_x ) for all x𝒳𝑥𝒳x\in{\mathcal{X}}italic_x ∈ caligraphic_X and 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G, and σ:𝒞𝒞′′:𝜎superscript𝒞superscript𝒞′′\sigma:{\mathcal{C}}^{\prime}\to{\mathcal{C}}^{\prime\prime}italic_σ : caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → caligraphic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT is an arbitrary (non-linear) map, then we easily verify that the composition U:=(𝝈B):𝒳(Ω,𝒞)𝒳(Ω,𝒞′′):assign𝑈𝝈𝐵𝒳Ω𝒞𝒳Ωsuperscript𝒞′′U:=(\bm{\sigma}\circ B):{\mathcal{X}}(\Omega,{\mathcal{C}})\to{\mathcal{X}}(\Omega,{\mathcal{C}}^{\prime\prime})italic_U := ( bold_italic_σ ∘ italic_B ) : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_X ( roman_Ω , caligraphic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) is also 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariant, where 𝝈:𝒳(Ω,𝒞)𝒳(Ω,𝒞′′):𝝈𝒳Ωsuperscript𝒞𝒳Ωsuperscript𝒞′′\bm{\sigma}:{\mathcal{X}}(\Omega,{\mathcal{C}}^{\prime})\to{\mathcal{X}}(\Omega,{\mathcal{C}}^{\prime\prime})bold_italic_σ : caligraphic_X ( roman_Ω , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) → caligraphic_X ( roman_Ω , caligraphic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) is the element-wise instantiation of σ𝜎\sigmaitalic_σ given as (𝝈(x))(u):=σ(x(u))assign𝝈𝑥𝑢𝜎𝑥𝑢(\bm{\sigma}(x))(u):=\sigma(x(u))( bold_italic_σ ( italic_x ) ) ( italic_u ) := italic_σ ( italic_x ( italic_u ) ).

This simple property allows us to define a very general family of 𝔊𝔊{\mathfrak{G}}fraktur_G-invariants, by composing U𝑈Uitalic_U with the group averages AU:𝒳(Ω,𝒞)𝒞′′:𝐴𝑈𝒳Ω𝒞superscript𝒞′′A\circ U:{\mathcal{X}}(\Omega,{\mathcal{C}})\to{\mathcal{C}}^{\prime\prime}italic_A ∘ italic_U : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. A natural question is thus whether any 𝔊𝔊{\mathfrak{G}}fraktur_G-invariant function can be approximated at arbitrary precision by such a model, for appropriate choices of B𝐵Bitalic_B and σ𝜎\sigmaitalic_σ. It is not hard to adapt the standard Universal Approximation Theorems from unstructured vector inputs to show that shallow ‘geometric’ networks are also universal approximators, by properly generalising the group average to a general non-linear invariant. 434343Such proofs have been demonstrated, for example, for the Deep Sets model by Zaheer et al. (2017).However, as already described in the case of Fourier versus Wavelet invariants, there is a fundamental tension between shallow global invariance and deformation stability. This motivates an alternative representation, which considers instead localised equivariant maps.444444Meaningful metrics can be defined on grids, graphs, manifolds, and groups. A notable exception are sets, where there is no predefined notion of metric. Assuming that ΩΩ\Omegaroman_Ω is further equipped with a distance metric d𝑑ditalic_d, we call an equivariant map U𝑈Uitalic_U localised if (Ux)(u)𝑈𝑥𝑢(Ux)(u)( italic_U italic_x ) ( italic_u ) depends only on the values of x(v)𝑥𝑣x(v)italic_x ( italic_v ) for 𝒩u={v:d(u,v)r}subscript𝒩𝑢conditional-set𝑣𝑑𝑢𝑣𝑟\mathcal{N}_{u}=\{v:d(u,v)\leq r\}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_v : italic_d ( italic_u , italic_v ) ≤ italic_r }, for some small radius r𝑟ritalic_r; the latter set 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is called the receptive field.

A single layer of local equivariant map U𝑈Uitalic_U cannot approximate functions with long-range interactions, but a composition of several local equivariant maps UJUJ1U1subscript𝑈𝐽subscript𝑈𝐽1subscript𝑈1U_{J}\circ U_{J-1}\dots\circ U_{1}italic_U start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∘ italic_U start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT ⋯ ∘ italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases the receptive field454545The term ‘receptive field’ originated in the neuroscience literature, referring to the spatial domain that affects the output of a given neuron.while preserving the stability properties of local equivariants. The receptive field is further increased by interleaving downsampling operators that coarsen the domain (again assuming a metric structure), completing the parallel with Multiresolution Analysis (MRA, see e.g. Mallat (1999)).

Refer to caption
Figure 8: Geometric Deep Learning blueprint, exemplified on a graph. A typical Graph Neural Network architecture may contain permutation equivariant layers (computing node-wise features), local pooling (graph coarsening), and a permutation-invariant global pooling layer (readout layer).

In summary, the geometry of the input domain, with knowledge of an underyling symmetry group, provides three key building blocks: (i) a local equivariant map, (ii) a global invariant map, and (iii) a coarsening operator. These building blocks provide a rich function approximation space with prescribed invariance and stability properties by combining them together in a scheme we refer to as the Geometric Deep Learning Blueprint (Figure 8).

Geometric Deep Learning Blueprint Let ΩΩ\Omegaroman_Ω and ΩsuperscriptΩ\Omega^{\prime}roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be domains, 𝔊𝔊{\mathfrak{G}}fraktur_G a symmetry group over ΩΩ\Omegaroman_Ω, and write ΩΩsuperscriptΩΩ\Omega^{\prime}\subseteq\Omegaroman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ roman_Ω if ΩsuperscriptΩ\Omega^{\prime}roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be considered a compact version of ΩΩ\Omegaroman_Ω. We define the following building blocks: Linear 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariant layer B:𝒳(Ω,𝒞)𝒳(Ω,𝒞):𝐵𝒳Ω𝒞𝒳superscriptΩsuperscript𝒞B:{\mathcal{X}}(\Omega,{\mathcal{C}})\to{\mathcal{X}}(\Omega^{\prime},{\mathcal{C}}^{\prime})italic_B : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) satisfying B(𝔤.x)=𝔤.B(x)B({\mathfrak{g}}.x)={\mathfrak{g}}.B(x)italic_B ( fraktur_g . italic_x ) = fraktur_g . italic_B ( italic_x ) for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G and x𝒳(Ω,𝒞)𝑥𝒳Ω𝒞x\in{\mathcal{X}}(\Omega,\mathcal{C})italic_x ∈ caligraphic_X ( roman_Ω , caligraphic_C ). Nonlinearity σ:𝒞𝒞:𝜎𝒞superscript𝒞\sigma:{\mathcal{C}}\to{\mathcal{C}}^{\prime}italic_σ : caligraphic_C → caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT applied element-wise as (𝝈(x))(u)=σ(x(u))𝝈𝑥𝑢𝜎𝑥𝑢(\bm{\sigma}(x))(u)=\sigma(x(u))( bold_italic_σ ( italic_x ) ) ( italic_u ) = italic_σ ( italic_x ( italic_u ) ). Local pooling (coarsening) P:𝒳(Ω,𝒞)𝒳(Ω,𝒞):𝑃𝒳Ω𝒞𝒳superscriptΩ𝒞P:{\mathcal{X}}(\Omega,{\mathcal{C}})\rightarrow{\mathcal{X}}(\Omega^{\prime},{\mathcal{C}})italic_P : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_C ), such that ΩΩsuperscriptΩΩ\Omega^{\prime}\subseteq\Omegaroman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ roman_Ω. 𝔊𝔊{\mathfrak{G}}fraktur_G-invariant layer (global pooling) A:𝒳(Ω,𝒞)𝒴:𝐴𝒳Ω𝒞𝒴A:{\mathcal{X}}(\Omega,{\mathcal{C}})\rightarrow\mathcal{Y}italic_A : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_Y satisfying A(𝔤.x)=A(x)A({\mathfrak{g}}.x)=A(x)italic_A ( fraktur_g . italic_x ) = italic_A ( italic_x ) for all 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G and x𝒳(Ω,𝒞)𝑥𝒳Ω𝒞x\in{\mathcal{X}}(\Omega,\mathcal{C})italic_x ∈ caligraphic_X ( roman_Ω , caligraphic_C ). Using these blocks allows constructing 𝔊𝔊{\mathfrak{G}}fraktur_G-invariant functions f:𝒳(Ω,𝒞)𝒴:𝑓𝒳Ω𝒞𝒴f:\mathcal{X}(\Omega,\mathcal{C})\rightarrow\mathcal{Y}italic_f : caligraphic_X ( roman_Ω , caligraphic_C ) → caligraphic_Y of the form f = A ∘σ_J ∘B_J ∘P_J-1 ∘…∘P_1 ∘σ_1 ∘B_1 where the blocks are selected such that the output space of each block matches the input space of the next one. Different blocks may exploit different choices of symmetry groups 𝔊𝔊{\mathfrak{G}}fraktur_G.

Different settings of Geometric Deep Learning

One can make an important distinction between the setting when the domain ΩΩ\Omegaroman_Ω is assumed to be fixed and one is only interested in varying input signals defined on that domain, or the domain is part of the input as varies together with signals defined on it. A classical instance of the former case is encountered in computer vision applications, where images are assumed to be defined on a fixed domain (grid). Graph classification is an example of the latter setting, where both the structure of the graph as well as the signal defined on it (e.g. node features) are important. In the case of varying domain, geometric stability (in the sense of insensitivity to the deformation of ΩΩ\Omegaroman_Ω) plays a crucial role in Geometric Deep Learning architectures.

This blueprint has the right level of generality to be used across a wide range of geometric domains. Different Geometric Deep Learning methods thus differ in their choice of the domain, symmetry group, and the specific implementation details of the aforementioned building blocks. As we will see in the following, a large class of deep learning architectures currently in use fall into this scheme and can thus be derived from common geometric principles.

In the following sections (4.14.6) we will describe the various geometric domains focusing on the ‘5G’, and in Sections 5.15.8 the specific implementations of Geometric Deep Learning on these domains.

Architecture Domain Ωnormal-Ω\Omegaroman_Ω Symmetry group 𝔊𝔊\mathfrak{G}fraktur_G
CNN Grid Translation
Spherical CNN Sphere / SO(3)SO3\mathrm{SO}({3})roman_SO ( 3 ) Rotation SO(3)SO3\mathrm{SO}({3})roman_SO ( 3 )
Intrinsic / Mesh CNN Manifold Isometry Iso(Ω)IsoΩ\mathrm{Iso}(\Omega)roman_Iso ( roman_Ω ) /
Gauge symmetry SO(2)SO2\mathrm{SO}(2)roman_SO ( 2 )
GNN Graph Permutation ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Deep Sets Set Permutation ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Transformer Complete Graph Permutation ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
LSTM 1D Grid Time warping

4 Geometric Domains: the 5 Gs

The main focus of our text will be on graphs, grids, groups, geodesics, and gauges. In this context, by ‘groups’ we mean global symmetry transformations in homogeneous space, by ‘geodesics’ metric structures on manifolds, and by ‘gauges’ local reference frames defined on tangent bundles (and vector bundles in general). These notions will be explained in more detail later. In the next sections, we will discuss in detail the main elements in common and the key distinguishing features between these structures and describe the symmetry groups associated with them. Our exposition is not in the order of generality – in fact, grids are particular cases of graphs – but a way to highlight important concepts underlying our Geometric Deep Learning blueprint.

Refer to caption
Figure 9: The 5G of Geometric Deep Learning: grids, groups & homogeneous spaces with global symmetry, graphs, geodesics & metrics on manifolds, and gauges (frames for tangent or feature spaces).

4.1 Graphs and Sets

In multiple branches of science, from sociology to particle physics, graphs are used as models of systems of relations and interactions. From our perspective, graphs give rise to a very basic type of invariance modelled by the group of permutations. Furthermore, other objects of interest to us, such as grids and sets, can be obtained as a particular case of graphs.

A graph 𝒢=(𝒱,)𝒢𝒱{\mathcal{G}}=({\mathcal{V}},{\mathcal{E}})caligraphic_G = ( caligraphic_V , caligraphic_E ) is a collection of nodes464646Depending on the application field, nodes may also be called vertices, and edges are often referred to as links or relations. We will use these terms interchangeably.𝒱𝒱{\mathcal{V}}caligraphic_V and edges 𝒱×𝒱𝒱𝒱{\mathcal{E}}\subseteq{\mathcal{V}}\times{\mathcal{V}}caligraphic_E ⊆ caligraphic_V × caligraphic_V between pairs of nodes. For the purpose of the following discussion, we will further assume the nodes to be endowed with s𝑠sitalic_s-dimensional node features, denoted by 𝐱usubscript𝐱𝑢\mathbf{x}_{u}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for all u𝒱𝑢𝒱u\in{\mathcal{V}}italic_u ∈ caligraphic_V. Social networks are perhaps among the most commonly studied examples of graphs, where nodes represent users, edges correspond to friendship relations between them, and node features model user properties such as age, profile picture, etc. It is also often possible to endow the edges, or entire graphs, with features;474747 [Uncaptioned image] Isomorphism is an edge-preserving bijection between two graphs. Two isomorphic graphs shown here are identical up to reordering of their nodes. but as this does not alter the main findings of this section, we will defer discussing it to future work.

The key structural property of graphs is that the nodes in 𝒱𝒱{\mathcal{V}}caligraphic_V are usually not assumed to be provided in any particular order, and thus any operations performed on graphs should not depend on the ordering of nodes. The desirable property that functions acting on graphs should satisfy is thus permutation invariance, and it implies that for any two isomorphic graphs, the outcomes of these functions are identical. We can see this as a particular setting of our blueprint, where the domain Ω=𝒢Ω𝒢\Omega=\mathcal{G}roman_Ω = caligraphic_G and the space 𝒳(𝒢,d)𝒳𝒢superscript𝑑\mathcal{X}(\mathcal{G},\mathbb{R}^{d})caligraphic_X ( caligraphic_G , blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) is that of d𝑑ditalic_d-dimensional node-wise signals. The symmetry we consider is given by the permutation group 𝔊=Σn𝔊subscriptΣ𝑛\mathfrak{G}=\Sigma_{n}fraktur_G = roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, whose elements are all the possible orderings of the set of node indices {1,,n}1𝑛\{1,\ldots,n\}{ 1 , … , italic_n }.

Let us first illustrate the concept of permutation invariance on sets, a special case of graphs without edges (i.e., ={\mathcal{E}}=\emptysetcaligraphic_E = ∅). By stacking the node features as rows of the n×d𝑛𝑑n\times ditalic_n × italic_d matrix 𝐗=(𝐱1,,𝐱n)𝐗superscriptsubscript𝐱1subscript𝐱𝑛top\mathbf{X}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})^{\top}bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, we do effectively specify an ordering of the nodes. The action of the permutation 𝔤Σn𝔤subscriptΣ𝑛\mathfrak{g}\in\Sigma_{n}fraktur_g ∈ roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT on the set of nodes amounts to the reordering of the rows of 𝐗𝐗\mathbf{X}bold_X, which can be represented as an n×n𝑛𝑛n\times nitalic_n × italic_n permutation matrix ρ(𝔤)=𝐏𝜌𝔤𝐏\rho(\mathfrak{g})=\mathbf{P}italic_ρ ( fraktur_g ) = bold_P,484848There are exactly n!𝑛n!italic_n ! such permutations, so ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is, even for modest n𝑛nitalic_n, a very large group. where each row and column contains exactly one 1111 and all the other entries are zeros.

A function f𝑓fitalic_f operating on this set is then said to be permutation invariant if, for any such permutation matrix 𝐏𝐏{\bf P}bold_P, it holds that f(𝐏𝐗)=f(𝐗)𝑓𝐏𝐗𝑓𝐗f({\bf P}{\bf X})=f({\bf X})italic_f ( bold_PX ) = italic_f ( bold_X ). One simple such function is

f(𝐗)=ϕ(u𝒱ψ(𝐱u)),𝑓𝐗italic-ϕsubscript𝑢𝒱𝜓subscript𝐱𝑢f({\bf X})=\phi\left(\sum_{u\in{\mathcal{V}}}\psi\left(\mathbf{x}_{u}\right)\right)~{},italic_f ( bold_X ) = italic_ϕ ( ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_V end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) , (6)

where the function ψ𝜓\psiitalic_ψ is independently applied to every node’s features, and ϕitalic-ϕ\phiitalic_ϕ is applied on its sum-aggregated outputs: as sum is independent of the order in which its inputs are provided, such a function is invariant with respect to the permutation of the node set, and is hence guaranteed to always return the same output, no matter how the nodes are permuted.

Functions like the above provide a ‘global’ graph-wise output, but very often, we will be interested in functions that act ‘locally’, in a node-wise manner. For example, we may want to apply some function to update the features in every node, obtaining the set of latent node features. If we stack these latent features into a matrix 𝐇=𝐅(𝐗)𝐇𝐅𝐗\mathbf{H}=\mathbf{F}({\bf X})bold_H = bold_F ( bold_X )494949We use the bold notation for our function 𝐅(𝐗)𝐅𝐗\mathbf{F}(\mathbf{X})bold_F ( bold_X ) to emphasise it outputs node-wise vector features and is hence a matrix-valued function.is no longer permutation invariant: the order of the rows of 𝐇𝐇{\bf H}bold_H should be tied to the order of the rows of 𝐗𝐗{\bf X}bold_X, so that we know which output node feature corresponds to which input node. We need instead a more fine-grained notion of permutation equivariance, stating that, once we “commit” to a permutation of inputs, it consistently permutes the resulting objects. Formally, 𝐅(𝐗)𝐅𝐗\mathbf{F}(\mathbf{X})bold_F ( bold_X ) is a permutation equivariant function if, for any permutation matrix 𝐏𝐏{\bf P}bold_P, it holds that 𝐅(𝐏𝐗)=𝐏𝐅(𝐗)𝐅𝐏𝐗𝐏𝐅𝐗\mathbf{F}({\bf P}{\bf X})={\bf P}\mathbf{F}({\bf X})bold_F ( bold_PX ) = bold_PF ( bold_X ). A shared node-wise linear transform

𝐅𝚯(𝐗)=𝐗𝚯subscript𝐅𝚯𝐗𝐗𝚯\mathbf{F}_{\mathbf{\Theta}}({\bf X})={\bf X}{\mathbf{\Theta}}bold_F start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( bold_X ) = bold_X bold_Θ (7)

specified by a weight matrix 𝚯d×d𝚯superscript𝑑superscript𝑑\mathbf{\Theta}\in\mathbb{R}^{d\times d^{\prime}}bold_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, is one possible construction of such a permutation equivariant function, producing in our example latent features of the form 𝐡u=𝚯𝐱usubscript𝐡𝑢superscript𝚯topsubscript𝐱𝑢\mathbf{h}_{u}=\bm{\Theta}^{\top}\mathbf{x}_{u}bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_Θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

This construction arises naturally from our Geometric Deep Learning blueprint. We can first attempt to characterise linear equivariants (functions of the form 𝐅𝐏𝐗=𝐏𝐅𝐗𝐅𝐏𝐗𝐏𝐅𝐗\mathbf{F}{\bf PX}=\bf{P}\mathbf{FX}bold_FPX = bold_PFX), for which it is easy to verify that any such map can be written as a linear combination of two generators, the identity 𝐅1𝐗=𝐗subscript𝐅1𝐗𝐗\mathbf{F}_{1}\mathbf{X}={\bf X}bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_X = bold_X and the average 𝐅2𝐗=1n𝟏𝟏𝐗=1nu=1n𝐱usubscript𝐅2𝐗1𝑛superscript11top𝐗1𝑛superscriptsubscript𝑢1𝑛subscript𝐱𝑢{\mathbf{F}_{2}\mathbf{X}}=\frac{1}{n}\bm{1}\bm{1}^{\top}\mathbf{X}=\frac{1}{n}\sum_{u=1}^{n}{\bf x}_{u}bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_X = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. As will be described in Section 5.4, the popular Deep Sets (Zaheer et al., 2017) architecture follows precisely this blueprint.

We can now generalise the notions of permutation invariance and equivariance from sets to graphs. In the generic setting {\mathcal{E}}\neq\emptysetcaligraphic_E ≠ ∅, the graph connectivity can be represented by the n×n𝑛𝑛n\times nitalic_n × italic_n adjacency matrix 𝐀𝐀\mathbf{A}bold_A,505050When the graph is undirected, i.e. (u,v)𝑢𝑣(u,v)\in{\mathcal{E}}( italic_u , italic_v ) ∈ caligraphic_E iff (v,u)𝑣𝑢(v,u)\in{\mathcal{E}}( italic_v , italic_u ) ∈ caligraphic_E, the adjacency matrix is symmetric, 𝐀=𝐀𝐀superscript𝐀top\mathbf{A}=\mathbf{A}^{\top}bold_A = bold_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. defined as

auv={1(u,v)0otherwise.subscript𝑎𝑢𝑣cases1𝑢𝑣0otherwisea_{uv}=\begin{cases}1&(u,v)\in{\mathcal{E}}\\ 0&\text{otherwise}.\end{cases}italic_a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL ( italic_u , italic_v ) ∈ caligraphic_E end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW (8)

Note that now the adjacency and feature matrices 𝐀𝐀\mathbf{A}bold_A and 𝐗𝐗\mathbf{X}bold_X are “synchronised”, in the sense that auvsubscript𝑎𝑢𝑣a_{uv}italic_a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT specifies the adjacency information between the nodes described by the u𝑢uitalic_uth and v𝑣vitalic_vth rows of 𝐗𝐗\mathbf{X}bold_X. Therefore, applying a permutation matrix 𝐏𝐏\mathbf{P}bold_P to the node features 𝐗𝐗\mathbf{X}bold_X automatically implies applying it to 𝐀𝐀\mathbf{A}bold_A’s rows and columns, 𝐏𝐀𝐏superscript𝐏𝐀𝐏top\mathbf{P}\mathbf{A}\mathbf{P}^{\top}bold_PAP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. 515151𝐏𝐀𝐏superscript𝐏𝐀𝐏top\mathbf{P}\mathbf{A}\mathbf{P}^{\top}bold_PAP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the representation of ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT acting on matrices. We say that (a graph-wise function) f𝑓fitalic_f is permutation invariant if

f(𝐏𝐗,𝐏𝐀𝐏)=f(𝐗,𝐀)𝑓𝐏𝐗superscript𝐏𝐀𝐏top𝑓𝐗𝐀f({\bf PX},{\bf PAP}^{\top})=f({\bf X},{\bf A})italic_f ( bold_PX , bold_PAP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = italic_f ( bold_X , bold_A ) (9)

and (a node-wise function) 𝐅𝐅\mathbf{F}bold_F is permutation equivariant if

𝐅(𝐏𝐗,𝐏𝐀𝐏)=𝐏𝐅(𝐗,𝐀)𝐅𝐏𝐗superscript𝐏𝐀𝐏top𝐏𝐅𝐗𝐀\mathbf{F}({\bf PX},{\bf PAP}^{\top})={\bf P}\mathbf{F}({\bf X},{\bf A})bold_F ( bold_PX , bold_PAP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = bold_PF ( bold_X , bold_A ) (10)

for any permutation matrix 𝐏𝐏{\bf P}bold_P.

Here again, we can first characterise linear equivariant functions.535353This corresponds to the Bell number B4subscript𝐵4B_{4}italic_B start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, which counts the number of ways to partition a set of 4444 elements, in this case given by the 4-indices (u,v),(u,v)𝑢𝑣superscript𝑢superscript𝑣(u,v),(u^{\prime},v^{\prime})( italic_u , italic_v ) , ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) indexing a linear map acting on the adjacency matrix. As observed by Maron et al. (2018), any linear 𝐅𝐅\mathbf{F}bold_F satisfying equation (10) can be expressed as a linear combination of fifteen linear generators; remarkably, this family of generators is independent of n𝑛nitalic_n. Amongst these generators, our blueprint specifically advocates for those that are also local, i.e., whereby the output on node u𝑢uitalic_u directly depends on its neighbouring nodes in the graph. We can formalise this constraint explicitly in our model construction, by defining what it means for a node to be neighbouring another.

A (undirected) neighbourhood of node u𝑢uitalic_u, sometimes also called 1-hop, is defined as 545454 Often, the node u𝑢uitalic_u itself is included in its own neighbourhood.

𝒩u={v:(u,v)or(v,u)}subscript𝒩𝑢conditional-set𝑣𝑢𝑣or𝑣𝑢\mathcal{N}_{u}=\{v:(u,v)\in{\mathcal{E}}\,\mathrm{or}\,(v,u)\in{\mathcal{E}}\}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_v : ( italic_u , italic_v ) ∈ caligraphic_E roman_or ( italic_v , italic_u ) ∈ caligraphic_E } (11)

and the neighbourhood features as the multiset

𝐗𝒩u={{𝐱v:v𝒩u}}.subscript𝐗subscript𝒩𝑢conditional-setsubscript𝐱𝑣𝑣subscript𝒩𝑢\mathbf{X}_{\mathcal{N}_{u}}=\{\mskip-5.0mu\{\mathbf{x}_{v}:v\in\mathcal{N}_{u}\}\mskip-5.0mu\}.bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { { bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT : italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } } . (12)

Operating on 1-hop neighbourhoods aligns well with the locality aspect of our blueprint: namely, defining our metric over graphs as the shortest path distance between nodes using edges in \mathcal{E}caligraphic_E.

The GDL blueprint thus yields a general recipe for constructing permutation equivariant functions on graphs, by specifying a local function ϕitalic-ϕ\phiitalic_ϕ that operates over the features of a node and its neighbourhood, ϕ(𝐱u,𝐗𝒩u)italic-ϕsubscript𝐱𝑢subscript𝐗subscript𝒩𝑢\phi(\mathbf{x}_{u},\mathbf{X}_{\mathcal{N}_{u}})italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Then, a permutation equivariant function 𝐅𝐅\mathbf{F}bold_F can be constructed by applying ϕitalic-ϕ\phiitalic_ϕ to every node’s neighbourhood in isolation (see Figure 10):

𝐅(𝐗,𝐀)=[ϕ(𝐱1,𝐗𝒩1)ϕ(𝐱2,𝐗𝒩2)ϕ(𝐱n,𝐗𝒩n)]𝐅𝐗𝐀delimited-[]missing-subexpressionitalic-ϕsubscript𝐱1subscript𝐗subscript𝒩1missing-subexpressionmissing-subexpressionitalic-ϕsubscript𝐱2subscript𝐗subscript𝒩2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionitalic-ϕsubscript𝐱𝑛subscript𝐗subscript𝒩𝑛missing-subexpression\mathbf{F}({\bf X},{\bf A})=\left[\begin{array}[]{ccc}\rule[2.15277pt]{10.76385pt}{0.5pt}&\phi(\mathbf{x}_{1},\mathbf{X}_{\mathcal{N}_{1}})&\rule[2.15277pt]{10.76385pt}{0.5pt}\\ \rule[2.15277pt]{10.76385pt}{0.5pt}&\phi(\mathbf{x}_{2},\mathbf{X}_{\mathcal{N}_{2}})&\rule[2.15277pt]{10.76385pt}{0.5pt}\\ &\vdots&\\ \rule[2.15277pt]{10.76385pt}{0.5pt}&\phi(\mathbf{x}_{n},\mathbf{X}_{\mathcal{N}_{n}})&\rule[2.15277pt]{10.76385pt}{0.5pt}\end{array}\right]bold_F ( bold_X , bold_A ) = [ start_ARRAY start_ROW start_CELL end_CELL start_CELL italic_ϕ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ϕ ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW end_ARRAY ] (13)

As 𝐅𝐅\mathbf{F}bold_F is constructed by applying a shared function ϕitalic-ϕ\phiitalic_ϕ to each node locally, its permutation equivariance rests on ϕitalic-ϕ\phiitalic_ϕ’s output being independent on the ordering of the nodes in 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Thus, if ϕitalic-ϕ\phiitalic_ϕ is built to be permutation invariant, then this property is satisfied. As we will see in future work, the choice of ϕitalic-ϕ\phiitalic_ϕ plays a crucial role in the expressive power of such a scheme. When ϕitalic-ϕ\phiitalic_ϕ is injective, it is equivalent to one step of the Weisfeiler-Lehman graph isomorphism test, a classical algorithm in graph theory providing a necessary condition for two graphs to be isomorphic by an iterative color refinement procedure.

Refer to caption
Figure 10: An illustration of constructing permutation-equivariant functions over graphs, by applying a permutation-invariant function ϕitalic-ϕ\phiitalic_ϕ to every neighbourhood. In this case, ϕitalic-ϕ\phiitalic_ϕ is applied to the features 𝐱bsubscript𝐱𝑏\mathbf{x}_{b}bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT of node b𝑏bitalic_b as well as the multiset of its neighbourhood features, 𝐗𝒩b={{𝐱a,𝐱b,𝐱c,𝐱d,𝐱e}}subscript𝐗subscript𝒩𝑏subscript𝐱𝑎subscript𝐱𝑏subscript𝐱𝑐subscript𝐱𝑑subscript𝐱𝑒\mathbf{X}_{\mathcal{N}_{b}}=\{\mskip-5.0mu\{\mathbf{x}_{a},\mathbf{x}_{b},\mathbf{x}_{c},\mathbf{x}_{d},\mathbf{x}_{e}\}\mskip-5.0mu\}bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { { bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } }. Applying ϕitalic-ϕ\phiitalic_ϕ in this manner to every node’s neighbourhood recovers the rows of the resulting matrix of latents features 𝐇=𝐅(𝐗,𝐀)𝐇𝐅𝐗𝐀\mathbf{H}=\mathbf{F}(\mathbf{X},\mathbf{A})bold_H = bold_F ( bold_X , bold_A ).

It is also worth noticing that the difference between functions defined on sets and more general graphs in this example is that in the latter case we need to explicitly account for the structure of the domain. As a consequence, graphs stand apart in the sense that the domain becomes part of the input in machine learning problems, whereas when dealing with sets and grids (both particular cases of graphs) we can specify only the features and assume the domain to be fixed. This distinction will be a recurring motif in our discussion. As a result, the notion of geometric stability (invariance to domain deformation) is crucial in most problems of learning on graphs. It straightforwardly follows from our construction that permutation invariant and equivariant functions produce identical outputs on isomorphic (topologically-equivalent) graphs. These results can be generalised to approximately isomorphic graphs, and several results on stability under graph perturbations exist (Levie et al., 2018). We will return to this important point in our discussion on manifolds, which we will use as an vehicle to study such invariance in further detail.

Second, due to their additional structure, graphs and grids, unlike sets, can be coarsened in a non-trivial way565656More precisely, we cannot define a non-trivial coarsening assuming set structure alone. There exist established approaches that infer topological structure from unordered sets, and those can admit non-trivial coarsening., giving rise to a variety of pooling operations.

4.2 Grids and Euclidean spaces

The second type of objects we consider are grids. It is fair to say that the impact of deep learning was particularly dramatic in computer vision, natural language processing, and speech recognition. These applications all share a geometric common denominator: an underlying grid structure. As already mentioned, grids are a particular case of graphs with special adjacency. However, since the order of nodes in a grid is fixed, machine learning models for signals defined on grids are no longer required to account for permutation invariance, and have a stronger geometric prior: translation invariance.

Circulant matrices and Convolutions

Let us dwell on this point in more detail. Assuming for simplicity periodic boundary conditions, we can think of a one-dimensional grid as a ring graph575757 [Uncaptioned image] with nodes indexed by 0,1,,n101𝑛10,1,\ldots,n-10 , 1 , … , italic_n - 1 modulo n𝑛nitalic_n (which we will omit for notation brevity) and the adjacency matrix with elements au,u+1modn=1subscript𝑎𝑢𝑢1mod𝑛1a_{u,u+1\,\mathrm{mod}\,n}=1italic_a start_POSTSUBSCRIPT italic_u , italic_u + 1 roman_mod italic_n end_POSTSUBSCRIPT = 1 and zero otherwise. There are two main differences from the general graph case we have discussed before. First, each node u𝑢uitalic_u has identical connectivity, to its neighbours u1𝑢1u-1italic_u - 1 and u+1𝑢1u+1italic_u + 1, and thus structure-wise indistinguishable from the others. 585858As we will see later, this makes the grid a homogeneous space. Second and more importantly, since the nodes of the grid have a fixed ordering, we also have a fixed ordering of the neighbours: we can call u1𝑢1u-1italic_u - 1 the ‘left neighbour’ and u+1𝑢1u+1italic_u + 1 the ‘right neighbour’. If we use our previous recipe for designing a equivariant function 𝐅𝐅\mathbf{F}bold_F using a local aggregation function ϕitalic-ϕ\phiitalic_ϕ, we now have 𝐟(𝐱u)=ϕ(𝐱u1,𝐱u,𝐱u+1)𝐟subscript𝐱𝑢italic-ϕsubscript𝐱𝑢1subscript𝐱𝑢subscript𝐱𝑢1\mathbf{f}(\mathbf{x}_{u})=\phi(\mathbf{x}_{u-1},\mathbf{x}_{u},\mathbf{x}_{u+1})bold_f ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u + 1 end_POSTSUBSCRIPT ) at every node of the grid: ϕitalic-ϕ\phiitalic_ϕ does not need to be permutation invariant anymore. For a particular choice of a linear transformation ϕ(𝐱u1,𝐱u,𝐱u+1)=θ1𝐱u1+θ0𝐱u+θ1𝐱u+1italic-ϕsubscript𝐱𝑢1subscript𝐱𝑢subscript𝐱𝑢1subscript𝜃1subscript𝐱𝑢1subscript𝜃0subscript𝐱𝑢subscript𝜃1subscript𝐱𝑢1\phi(\mathbf{x}_{u-1},\mathbf{x}_{u},\mathbf{x}_{u+1})=\theta_{-1}\mathbf{x}_{u-1}+\theta_{0}\mathbf{x}_{u}+\theta_{1}\mathbf{x}_{u+1}italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u + 1 end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_u - 1 end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_u + 1 end_POSTSUBSCRIPT, we can write 𝐅(𝐗)𝐅𝐗\mathbf{F}(\mathbf{X})bold_F ( bold_X ) as a matrix product,

𝐅(𝐗)=[θ0θ1θ1θ1θ0θ1θ1θ0θ1θ1θ1θ0][𝐱0𝐱1𝐱n2𝐱n1]𝐅𝐗delimited-[]subscript𝜃0subscript𝜃1missing-subexpressionmissing-subexpressionsubscript𝜃1subscript𝜃1subscript𝜃0subscript𝜃1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝜃1subscript𝜃0subscript𝜃1subscript𝜃1missing-subexpressionmissing-subexpressionsubscript𝜃1subscript𝜃0delimited-[]missing-subexpressionsubscript𝐱0missing-subexpressionmissing-subexpressionsubscript𝐱1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝐱𝑛2missing-subexpressionmissing-subexpressionsubscript𝐱𝑛1missing-subexpression\mathbf{F}(\mathbf{X})=\left[\begin{array}[]{ccccc}\theta_{0}&\theta_{1}&&&\theta_{-1}\\ \theta_{-1}&\theta_{0}&\theta_{1}&&\\ &\ddots&\ddots&\ddots&\\ &&\theta_{-1}&\theta_{0}&\theta_{1}\\ \theta_{1}&&&\theta_{-1}&\theta_{0}\par\end{array}\right]\left[\begin{array}[]{ccc}\rule[2.15277pt]{10.76385pt}{0.5pt}&\mathbf{x}_{0}&\rule[2.15277pt]{10.76385pt}{0.5pt}\\ \rule[2.15277pt]{10.76385pt}{0.5pt}&\mathbf{x}_{1}&\rule[2.15277pt]{10.76385pt}{0.5pt}\\ &\vdots&\\ \rule[2.15277pt]{10.76385pt}{0.5pt}&\mathbf{x}_{n-2}&\rule[2.15277pt]{10.76385pt}{0.5pt}\\ \rule[2.15277pt]{10.76385pt}{0.5pt}&\mathbf{x}_{n-1}&\rule[2.15277pt]{10.76385pt}{0.5pt}\end{array}\right]bold_F ( bold_X ) = [ start_ARRAY start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL italic_θ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_θ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL italic_θ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL end_CELL start_CELL bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY ]

Note this very special multi-diagonal structure with one element repeated along each diagonal, sometimes referred to as “weight sharing” in the machine learning literature.

More generally, given a vector 𝜽=(θ0,,θn1)𝜽subscript𝜃0subscript𝜃𝑛1\bm{\theta}=(\theta_{0},\ldots,\theta_{n-1})bold_italic_θ = ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ), a circulant matrix 𝐂(𝜽)=(θuvmodn)𝐂𝜽subscript𝜃𝑢𝑣mod𝑛\mathbf{C}(\bm{\theta})=(\theta_{u-v\,\mathrm{mod}\,n})bold_C ( bold_italic_θ ) = ( italic_θ start_POSTSUBSCRIPT italic_u - italic_v roman_mod italic_n end_POSTSUBSCRIPT ) is obtained by appending circularly shifted versions of the vector 𝜽𝜽\bm{\theta}bold_italic_θ. Circulant matrices are synonymous with discrete convolutions, 595959Because of the periodic boundary conditions, it is a circular or cyclic convolution. In signal processing, 𝜽𝜽\bm{\theta}bold_italic_θ is often referred to as the “filter,” and in CNNs, its coefficients are learnable.

(𝐱𝜽)u=v=0n1xvmodnθuvmodnsubscript𝐱𝜽𝑢superscriptsubscript𝑣0𝑛1subscript𝑥𝑣mod𝑛subscript𝜃𝑢𝑣mod𝑛(\mathbf{x}\star\bm{\theta})_{u}=\sum_{v=0}^{n-1}x_{v\,\mathrm{mod}\,n}\,\,\theta_{u-v\,\mathrm{mod}\,n}( bold_x ⋆ bold_italic_θ ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_v roman_mod italic_n end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_u - italic_v roman_mod italic_n end_POSTSUBSCRIPT

as one has 𝐂(𝜽)𝐱=𝐱𝜽𝐂𝜽𝐱𝐱𝜽\mathbf{C}(\bm{\theta})\mathbf{x}=\mathbf{x}\star\bm{\theta}bold_C ( bold_italic_θ ) bold_x = bold_x ⋆ bold_italic_θ. A particular choice of 𝜽=(0,1,0,,0)𝜽superscript0100top\bm{\theta}=(0,1,0,\ldots,0)^{\top}bold_italic_θ = ( 0 , 1 , 0 , … , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT yields a special circulant matrix that shifts vectors to the right by one position. This matrix is called the (right) shift or translation operator and denoted by 𝐒𝐒\mathbf{S}bold_S.606060The left shift operator is given by 𝐒superscript𝐒top\mathbf{S}^{\top}bold_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Obviously, shifting left and then right (or vice versa) does not do anything, which means 𝐒𝐒\mathbf{S}bold_S is orthogonal: 𝐒𝐒=𝐒𝐒=𝐈superscript𝐒top𝐒superscript𝐒𝐒top𝐈\mathbf{S}^{\top}\mathbf{S}=\mathbf{S}\mathbf{S}^{\top}=\mathbf{I}bold_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_S = bold_SS start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_I.

Circulant matrices can be characterised by their commutativity property: the product of circulant matrices is commutative, i.e. 𝐂(𝜽)𝐂(𝜼)=𝐂(𝜼)𝐂(𝜽)𝐂𝜽𝐂𝜼𝐂𝜼𝐂𝜽\mathbf{C}(\bm{\theta})\mathbf{C}(\bm{\eta})=\mathbf{C}(\bm{\eta})\mathbf{C}(\bm{\theta})bold_C ( bold_italic_θ ) bold_C ( bold_italic_η ) = bold_C ( bold_italic_η ) bold_C ( bold_italic_θ ) for any 𝜽𝜽\bm{\theta}bold_italic_θ and 𝜼𝜼\bm{\eta}bold_italic_η. Since the shift is a circulant matrix, we get the familiar translation or shift equivariance of the convolution operator,

𝐒𝐂(𝜽)𝐱=𝐂(𝜽)𝐒𝐱.𝐒𝐂𝜽𝐱𝐂𝜽𝐒𝐱\mathbf{S}\mathbf{C}(\bm{\theta})\mathbf{x}=\mathbf{C}(\bm{\theta})\mathbf{S}\mathbf{x}.bold_SC ( bold_italic_θ ) bold_x = bold_C ( bold_italic_θ ) bold_Sx .

Such commutativity property should not be surprising, since the underlying symmetry group (the translation group) is Abelian. Moreover, the opposite direction appears to be true as well, i.e. a matrix is circulant iff it commutes with shift. This, in turn, allows us to define convolution as a translation equivariant linear operation, and is a nice illustration of the power of geometric priors and the overall philosophy of Geometric ML: convolution emerges from the first principle of translational symmetry.

Note that unlike the situation on sets and graphs, the number of linearly independent shift-equivariant functions (convolutions) grows with the size of the domain (since we have one degree of freedom in each diagonal of a circulant matrix). However, the scale separation prior guarantees filters can be local, resulting in the same Θ(1)Θ1\Theta(1)roman_Θ ( 1 )-parameter complexity per layer, as we will verify in Section 5.1 when discussing the use of these principles in the implementation of Convolutional Neural Network architectures.

Derivation of the discrete Fourier transform

We have already mentioned the Fourier transform and its connection to convolution: the fact that the Fourier transform diagonalises the convolution operation is an important property used in signal processing to perform convolution in the frequency domain as an element-wise product of the Fourier transforms. However, textbooks usually only state this fact, rarely explaining where the Fourier transform comes from and what is so special about the Fourier basis. Here we can show it, demonstrating once more how foundational are the basic principles of symmetry.

For this purpose, recall a fact from linear616161We must additionally assume distinct eigenvalues, otherwise there might be multiple possible diagonalisations. This assumption is satisfied with our choice of 𝐒𝐒\mathbf{S}bold_S.algebra that (diagonalisable) matrices are joinly diagonalisable iff they mutually commute. In other words, there exists a common eigenbasis for all the circulant matrices, in which they differ only by their eigenvalues. We can therefore pick one circulant matrix and compute its eigenvectors—we are assured that these will be the eigenvectors of all other circulant matrices as well. It is convenient to pick the shift operator, for which the eigenvectors happen to be the discrete Fourier basis 626262 𝐒𝐒\mathbf{S}bold_S is orthogonal but non-symmetric, hence, its eigenvectors are orthogonal but the eigenvalues are complex (roots of unity).

𝝋k=1n(1,e2πikn,e4πikn,,e2πi(n1)kn),k=0,1,,n1,formulae-sequencesubscript𝝋𝑘1𝑛superscript1superscript𝑒2𝜋i𝑘𝑛superscript𝑒4𝜋i𝑘𝑛superscript𝑒2𝜋i𝑛1𝑘𝑛top𝑘01𝑛1\bm{\varphi}_{k}=\frac{1}{\sqrt{n}}\left(1,e^{\frac{2\pi\mathrm{i}k}{n}},e^{\frac{4\pi\mathrm{i}k}{n}},\ldots,e^{\frac{2\pi\mathrm{i}(n-1)k}{n}}\right)^{\top},\hskip 14.22636ptk=0,1,\ldots,n-1,bold_italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( 1 , italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π roman_i italic_k end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT divide start_ARG 4 italic_π roman_i italic_k end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT , … , italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π roman_i ( italic_n - 1 ) italic_k end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_k = 0 , 1 , … , italic_n - 1 ,

which we can arrange into an n×n𝑛𝑛n\times nitalic_n × italic_n Fourier matrix 𝚽=(𝝋0,,𝝋n1)𝚽subscript𝝋0subscript𝝋𝑛1\bm{\Phi}=(\bm{\varphi}_{0},\ldots,\bm{\varphi}_{n-1})bold_Φ = ( bold_italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_italic_φ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ). Multiplication by 𝚽*superscript𝚽\bm{\Phi}^{*}bold_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT636363Note that the eigenvectors are complex, so we need to take complex conjugation when transposing 𝚽𝚽\bm{\Phi}bold_Φ.gives the Discrete Fourier Transform (DFT), and by 𝚽𝚽\bm{\Phi}bold_Φ the inverse DFT,

x^k=1nu=0n1xue2πikunxu=1nk=0n1x^ke+2πikun.formulae-sequencesubscript^𝑥𝑘1𝑛superscriptsubscript𝑢0𝑛1subscript𝑥𝑢superscript𝑒2𝜋i𝑘𝑢𝑛subscript𝑥𝑢1𝑛superscriptsubscript𝑘0𝑛1subscript^𝑥𝑘superscript𝑒2𝜋i𝑘𝑢𝑛\hat{x}_{k}=\frac{1}{\sqrt{n}}\sum_{u=0}^{n-1}x_{u}e^{-\frac{2\pi\mathrm{i}ku}{n}}\hskip 42.67912pt{x}_{u}=\frac{1}{\sqrt{n}}\sum_{k=0}^{n-1}\hat{x}_{k}e^{+\frac{2\pi\mathrm{i}ku}{n}}.over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_u = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 2 italic_π roman_i italic_k italic_u end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT + divide start_ARG 2 italic_π roman_i italic_k italic_u end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT .

Since all circulant matrices are jointly diagonalisable,646464 Since the Fourier transform is an orthogonal matrix (𝚽*𝚽=𝐈superscript𝚽𝚽𝐈\bm{\Phi}^{*}\bm{\Phi}=\mathbf{I}bold_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_Φ = bold_I), geometrically it acts as a change of the system of coordinates that amounts to an n𝑛nitalic_n-dimensional rotation. In this system of coordinates (“Fourier domain”), the action of a circulant 𝐂𝐂\mathbf{C}bold_C matrix becomes element-wise product. they are also diagonalised by the Fourier transform and differ only in their eigenvalues. Since the eigenvalues of the circulant matrix 𝐂(𝜽)𝐂𝜽\mathbf{C}(\bm{\theta})bold_C ( bold_italic_θ ) are the Fourier transform of the filter (see e.g. Bamieh (2018)), 𝜽^=𝚽*𝜽^𝜽superscript𝚽𝜽\hat{\bm{\theta}}=\bm{\Phi}^{*}\bm{\theta}over^ start_ARG bold_italic_θ end_ARG = bold_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_italic_θ, we obtain the Convolution Theorem:

𝐂(𝜽)𝐱=𝚽[θ^0θ^n1]𝚽*𝐱=𝚽(𝜽^𝐱^)𝐂𝜽𝐱𝚽delimited-[]subscript^𝜃0missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript^𝜃𝑛1superscript𝚽𝐱𝚽direct-product^𝜽^𝐱\mathbf{C}(\bm{\theta})\mathbf{x}=\bm{\Phi}\left[\begin{array}[]{ccc}\hat{\theta}_{0}&&\\ &\ddots&\\ &&\hat{\theta}_{n-1}\end{array}\right]\bm{\Phi}^{*}\mathbf{x}=\bm{\Phi}(\hat{\bm{\theta}}\odot\hat{\mathbf{x}})bold_C ( bold_italic_θ ) bold_x = bold_Φ [ start_ARRAY start_ROW start_CELL over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] bold_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_x = bold_Φ ( over^ start_ARG bold_italic_θ end_ARG ⊙ over^ start_ARG bold_x end_ARG )

Because the Fourier matrix 𝚽𝚽\bm{\Phi}bold_Φ has a special algebraic structure, the products 𝚽𝐱superscript𝚽𝐱\bm{\Phi}^{\star}\mathbf{x}bold_Φ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT bold_x and 𝚽𝐱𝚽𝐱\bm{\Phi}\mathbf{x}bold_Φ bold_x can be computed with 𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n ) complexity using a Fast Fourier Transform (FFT) algorithm. This is one of the reasons why frequency-domain filtering is so popular in signal processing; furthermore, the filter is typically designed directly in the frequency domain, so the Fourier transform 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG is never explicitly computed.

Besides the didactic value of the derivation of the Fourier transform and convolution we have done here, it provides a scheme to generalise these concepts to graphs. Realising that the adjacency matrix of the ring graph is exactly the shift operator, one can can develop the graph Fourier transform and an analogy of the convolution operator by computing the eigenvectors of the adjacency matrix (see e.g. Sandryhaila and Moura (2013)). Early attempts to develop graph neural networks by analogy to CNNs, sometimes termed ‘spectral GNNs’, exploited this exact blueprint.656565In graph signal processing, the eigenvectors of the graph Laplacian are often used as an alternative of the adjacency matrix to construct the graph Fourier transform, see Shuman et al. (2013). On grids, both matrices have joint eigenvectors, but on graphs they results in somewhat different though related constructions.We will see in Sections 4.44.6 that this analogy has some important limitations. The first limitation comes from the fact that a grid is fixed, and hence all signals on it can be represented in the same Fourier basis. In contrast, on general graphs, the Fourier basis depends on the structure of the graph. Hence, we cannot directly compare Fourier transforms on two different graphs — a problem that translated into a lack of generalisation in machine learning problems. Secondly, multi-dimensional grids, which are constructed as tensor products of one-dimensional grids, retain the underlying structure: the Fourier basis elements and the corresponding frequencies (eigenvalues) can be organised in multiple dimensions. In images, for example, we can naturally talk about horizontal and vertical frequency and filters have a notion of direction. On graphs, the structure of the Fourier domain is one-dimensional, as we can only organise the Fourier basis functions by the magnitude of the corresponding frequencies. As a result, graph filters are oblivious of direction or isotropic.

Derivation of the continuous Fourier transform

For the sake of completeness, and as a segway for the next discussion, we repeat our analysis in the continuous setting. Like in Section 3.4, consider functions defined on Ω=Ω\Omega=\mathbb{R}roman_Ω = blackboard_R and the translation operator (Svf)(u)=f(uv)subscript𝑆𝑣𝑓𝑢𝑓𝑢𝑣(S_{v}f)(u)=f(u-v)( italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_f ) ( italic_u ) = italic_f ( italic_u - italic_v ) shifting f𝑓fitalic_f by some position v𝑣vitalic_v. Applying Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to the Fourier basis functions φξ(u)=eiξusubscript𝜑𝜉𝑢superscript𝑒i𝜉𝑢\varphi_{\xi}(u)=e^{\mathrm{i}\xi u}italic_φ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_u ) = italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT yields, by associativity of the exponent,

Sveiξu=eiξ(uv)=eiξveiξu,subscript𝑆𝑣superscript𝑒i𝜉𝑢superscript𝑒i𝜉𝑢𝑣superscript𝑒i𝜉𝑣superscript𝑒i𝜉𝑢S_{v}e^{\mathrm{i}\xi u}=e^{\mathrm{i}\xi(u-v)}=e^{-\mathrm{i}\xi v}e^{\mathrm{i}\xi u},italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT roman_i italic_ξ ( italic_u - italic_v ) end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT ,

i.e., φuξ(u)𝜑subscript𝑢𝜉𝑢\varphi{u}_{\xi}(u)italic_φ italic_u start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_u ) is the complex eigenvector of Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with the complex eigenvalue eiξvsuperscript𝑒i𝜉𝑣e^{-\mathrm{i}\xi v}italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT – exactly mirroring the situation we had in the discrete setting. Since Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a unitary operator (i.e., Svxp=xpsubscriptnormsubscript𝑆𝑣𝑥𝑝subscriptnorm𝑥𝑝\|S_{v}x\|_{p}=\|x\|_{p}∥ italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∥ italic_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for any p𝑝pitalic_p and xLp()𝑥subscript𝐿𝑝x\in L_{p}(\mathbb{R})italic_x ∈ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( blackboard_R )), any eigenvalue λ𝜆\lambdaitalic_λ must satisfy |λ|=1𝜆1|\lambda|=1| italic_λ | = 1, which corresponds precisely to the eigenvalues eiξvsuperscript𝑒𝑖𝜉𝑣e^{-i\xi v}italic_e start_POSTSUPERSCRIPT - italic_i italic_ξ italic_v end_POSTSUPERSCRIPT found above. Moreover, the spectrum of the translation operator is simple, meaning that two functions sharing the same eigenvalue must necessarily be collinear. Indeed, suppose that Svf=eiξ0vfsubscript𝑆𝑣𝑓superscript𝑒isubscript𝜉0𝑣𝑓S_{v}f=e^{-\mathrm{i}\xi_{0}v}fitalic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_f = italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v end_POSTSUPERSCRIPT italic_f for some ξ0subscript𝜉0\xi_{0}italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Taking the Fourier transform in both sides, we obtain

ξ,eiξvf^(ξ)=eiξ0vf^(ξ),for-all𝜉superscript𝑒i𝜉𝑣^𝑓𝜉superscript𝑒isubscript𝜉0𝑣^𝑓𝜉\forall~{}\xi~{},~{}e^{-\mathrm{i}\xi v}\hat{f}(\xi)=e^{-\mathrm{i}\xi_{0}v}\hat{f}(\xi)~{},∀ italic_ξ , italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG ( italic_ξ ) = italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG ( italic_ξ ) ,

which implies that f^(ξ)=0^𝑓𝜉0\hat{f}(\xi)=0over^ start_ARG italic_f end_ARG ( italic_ξ ) = 0 for ξξ0𝜉subscript𝜉0\xi\neq\xi_{0}italic_ξ ≠ italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, thus f=αφξ0𝑓𝛼subscript𝜑subscript𝜉0f=\alpha\varphi_{\xi_{0}}italic_f = italic_α italic_φ start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

For a general linear operator C𝐶Citalic_C that is translation equivariant (SvC=CSvsubscript𝑆𝑣𝐶𝐶subscript𝑆𝑣S_{v}C=CS_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_C = italic_C italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), we have

SvCeiξu=CSveiξu=eiξvCeiξu,subscript𝑆𝑣𝐶superscript𝑒i𝜉𝑢𝐶subscript𝑆𝑣superscript𝑒i𝜉𝑢superscript𝑒i𝜉𝑣𝐶superscript𝑒i𝜉𝑢S_{v}Ce^{\mathrm{i}\xi u}=CS_{v}e^{\mathrm{i}\xi u}=e^{-\mathrm{i}\xi v}Ce^{\mathrm{i}\xi u},italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_C italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT = italic_C italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT italic_C italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT ,

implying that Ceiξu𝐶superscript𝑒i𝜉𝑢Ce^{\mathrm{i}\xi u}italic_C italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT is also an eigenfunction666666Eigenfunction is synonymous with ‘eigenvector’ and is used when referring to eigenvectors of continuous operators.of Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with eigenvalue eiξvsuperscript𝑒i𝜉𝑣e^{-\mathrm{i}\xi v}italic_e start_POSTSUPERSCRIPT - roman_i italic_ξ italic_v end_POSTSUPERSCRIPT, from where it follows from the simplicity of spectrum that Ceiξu=βφξ(u)𝐶superscript𝑒i𝜉𝑢𝛽subscript𝜑𝜉𝑢Ce^{\mathrm{i}\xi u}=\beta\varphi_{\xi}(u)italic_C italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT = italic_β italic_φ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_u ); in other words, the Fourier basis is the eigenbasis of all translation equivariant operators. As a result, C𝐶Citalic_C is diagonal in the Fourier domain and can be expressed as Ceiξu=p^C(ξ)eiξu𝐶superscript𝑒i𝜉𝑢subscript^𝑝𝐶𝜉superscript𝑒i𝜉𝑢Ce^{\mathrm{i}\xi u}=\hat{p}_{C}(\xi)e^{\mathrm{i}\xi u}italic_C italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_ξ ) italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT, where p^C(ξ)subscript^𝑝𝐶𝜉\hat{p}_{C}(\xi)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_ξ ) is a transfer function acting on different frequencies ξ𝜉\xiitalic_ξ. Finally, for an arbitrary function x(u)𝑥𝑢x(u)italic_x ( italic_u ), by linearity,

(Cx)(u)𝐶𝑥𝑢\displaystyle(Cx)(u)( italic_C italic_x ) ( italic_u ) =\displaystyle== C+x^(ξ)eiξudξ=+x^(ξ)p^C(ξ)eiξudξ𝐶superscriptsubscript^𝑥𝜉superscript𝑒i𝜉𝑢differential-d𝜉superscriptsubscript^𝑥𝜉subscript^𝑝𝐶𝜉superscript𝑒i𝜉𝑢differential-d𝜉\displaystyle C\int_{-\infty}^{+\infty}\hat{x}(\xi)e^{\mathrm{i}\xi u}\mathrm{d}\xi=\int_{-\infty}^{+\infty}\hat{x}(\xi)\hat{p}_{C}(\xi)e^{\mathrm{i}\xi u}\mathrm{d}\xiitalic_C ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG ( italic_ξ ) italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT roman_d italic_ξ = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG ( italic_ξ ) over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_ξ ) italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT roman_d italic_ξ
=\displaystyle== +pC(v)x(uv)dv=(xpC)(u),superscriptsubscriptsubscript𝑝𝐶𝑣𝑥𝑢𝑣differential-d𝑣𝑥subscript𝑝𝐶𝑢\displaystyle\int_{-\infty}^{+\infty}p_{C}(v)x(u-v)\mathrm{d}v~{}=(x\star p_{C})(u),∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_v ) italic_x ( italic_u - italic_v ) roman_d italic_v = ( italic_x ⋆ italic_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ( italic_u ) ,

where pC(u)subscript𝑝𝐶𝑢p_{C}(u)italic_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_u ) is the inverse Fourier transform of p^C(ξ)subscript^𝑝𝐶𝜉\hat{p}_{C}(\xi)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_ξ ). It thus follows that every linear translation equivariant operator is a convolution.

4.3 Groups and Homogeneous spaces

Our discussion of grids highlighted how shifts and convolutions are intimately connected: convolutions are linear shift-equivariant686868Technically, we need the group to be locally compact, so that there exists a left-invariant Haar measure. Integrating with respect to this measure, we can “shift” the integrand by any group element and obtain the same result, just as how we have +x(u)du=+x(uv)dusuperscriptsubscript𝑥𝑢differential-d𝑢superscriptsubscript𝑥𝑢𝑣differential-d𝑢\int_{-\infty}^{+\infty}x(u)\mathrm{d}u=\int_{-\infty}^{+\infty}x(u-v)\mathrm{d}u∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_x ( italic_u ) roman_d italic_u = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_x ( italic_u - italic_v ) roman_d italic_u for functions x::𝑥x:\mathbb{R}\rightarrow\mathbb{R}italic_x : blackboard_R → blackboard_R.operations, and vice versa, any shift-equivariant linear operator is a convolution. Furthermore, shift operators can be jointly diagonalised by the Fourier transform. As it turns out, this is part of a far larger story: both convolution and the Fourier transform can be defined for any group of symmetries that we can sum or integrate over.

Consider the Euclidean domain Ω=Ω\Omega=\mathbb{R}roman_Ω = blackboard_R. We can understand the convolution as a pattern matching operation: we match shifted copies of a filter θ(u)𝜃𝑢\theta(u)italic_θ ( italic_u ) with an input signal x(u)𝑥𝑢x(u)italic_x ( italic_u ). The value of the convolution (xθ)(u)𝑥𝜃𝑢(x\star\theta)(u)( italic_x ⋆ italic_θ ) ( italic_u ) at a point u𝑢uitalic_u is the inner product of the signal x𝑥xitalic_x with the filter shifted by u𝑢uitalic_u,

(xθ)(u)=x,Suθ=x(v)θ(u+v)dv.𝑥𝜃𝑢𝑥subscript𝑆𝑢𝜃subscript𝑥𝑣𝜃𝑢𝑣differential-d𝑣(x\star\theta)(u)=\langle x,S_{u}\theta\rangle=\int_{\mathbb{R}}x(v)\theta(u+v)\mathrm{d}v.( italic_x ⋆ italic_θ ) ( italic_u ) = ⟨ italic_x , italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_θ ⟩ = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_x ( italic_v ) italic_θ ( italic_u + italic_v ) roman_d italic_v .

Note that in this case u𝑢uitalic_u is both a point on the domain Ω=normal-Ω\Omega=\mathbb{R}roman_Ω = blackboard_R and also an element of the translation group, which we can identify with the domain itself, 𝔊=𝔊{\mathfrak{G}}=\mathbb{R}fraktur_G = blackboard_R. We will now show how to generalise this construction, by simply replacing the translation group by another group 𝔊𝔊{\mathfrak{G}}fraktur_G acting on ΩΩ\Omegaroman_Ω.

Group convolution

As discussed in Section 3, the action of the group 𝔊𝔊{\mathfrak{G}}fraktur_G on the domain ΩΩ\Omegaroman_Ω induces a representation ρ𝜌\rhoitalic_ρ of 𝔊𝔊{\mathfrak{G}}fraktur_G on the space of signals 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) via ρ(𝔤)x(u)=x(𝔤1u)𝜌𝔤𝑥𝑢𝑥superscript𝔤1𝑢\rho({\mathfrak{g}})x(u)=x({\mathfrak{g}}^{-1}u)italic_ρ ( fraktur_g ) italic_x ( italic_u ) = italic_x ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u ). In the above example, 𝔊𝔊{\mathfrak{G}}fraktur_G is the translation group whose elements act by shifting the coordinates, u+v𝑢𝑣u+vitalic_u + italic_v, whereas ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ) is the shift operator acting on signals as (Svx)(u)=x(uv)subscript𝑆𝑣𝑥𝑢𝑥𝑢𝑣(S_{v}x)(u)=x(u-v)( italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x ) ( italic_u ) = italic_x ( italic_u - italic_v ). Finally, in order to apply a filter to the signal, we invoke our assumption of 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) being a Hilbert space, with an inner product

x,θ=Ωx(u)θ(u)du,𝑥𝜃subscriptΩ𝑥𝑢𝜃𝑢differential-d𝑢\langle x,\theta\rangle=\int_{\Omega}x(u)\theta(u)\mathrm{d}u,⟨ italic_x , italic_θ ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_x ( italic_u ) italic_θ ( italic_u ) roman_d italic_u ,

where we assumed, for the sake of simplicity, scalar-valued signals, 𝒳(Ω,)𝒳Ω\mathcal{X}(\Omega,\mathbb{R})caligraphic_X ( roman_Ω , blackboard_R ); in general the inner product has the form of equation (2).

Having thus defined how to transform signals and match them with filters, we can define the group convolution for signals on ΩΩ\Omegaroman_Ω,

(xθ)(𝔤)=x,ρ(𝔤)θ=Ωx(u)θ(𝔤1u)du.𝑥𝜃𝔤𝑥𝜌𝔤𝜃subscriptΩ𝑥𝑢𝜃superscript𝔤1𝑢differential-d𝑢(x\star\theta)({\mathfrak{g}})=\langle x,\rho({\mathfrak{g}})\theta\rangle=\int_{\Omega}x(u)\theta({\mathfrak{g}}^{-1}u)\mathrm{d}u.( italic_x ⋆ italic_θ ) ( fraktur_g ) = ⟨ italic_x , italic_ρ ( fraktur_g ) italic_θ ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_x ( italic_u ) italic_θ ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u ) roman_d italic_u . (14)

Note that xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ takes values on the elements 𝔤𝔤{\mathfrak{g}}fraktur_g of our group 𝔊𝔊{\mathfrak{G}}fraktur_G rather than points on the domain ΩΩ\Omegaroman_Ω. Hence, the next layer, which takes xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ as input, should act on signals defined on to the group 𝔊𝔊{\mathfrak{G}}fraktur_G, a point we will return to shortly.

Just like how the traditional Euclidean convolution is shift-equivariant, the more general group convolution is 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariant. The key observation is that matching the signal x𝑥xitalic_x with a 𝔤𝔤{\mathfrak{g}}fraktur_g-transformed filter ρ(𝔤)θ𝜌𝔤𝜃\rho({\mathfrak{g}})\thetaitalic_ρ ( fraktur_g ) italic_θ is the same as matching the inverse transformed signal ρ(𝔤1)x𝜌superscript𝔤1𝑥\rho({\mathfrak{g}}^{-1})xitalic_ρ ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_x with the untransformed filter θ𝜃\thetaitalic_θ. Mathematically, this can be expressed as x,ρ(𝔤)θ=ρ(𝔤1)x,θ𝑥𝜌𝔤𝜃𝜌superscript𝔤1𝑥𝜃\langle x,\rho({\mathfrak{g}})\theta\rangle=\langle\rho({\mathfrak{g}}^{-1})x,\theta\rangle⟨ italic_x , italic_ρ ( fraktur_g ) italic_θ ⟩ = ⟨ italic_ρ ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_x , italic_θ ⟩. With this insight, 𝔊𝔊{\mathfrak{G}}fraktur_G-equivariance of the group convolution (14) follows immediately from its definition and the defining property ρ(𝔥1)ρ(𝔤)=ρ(𝔥1𝔤)𝜌superscript𝔥1𝜌𝔤𝜌superscript𝔥1𝔤\rho({\mathfrak{h}}^{-1})\rho({\mathfrak{g}})=\rho({\mathfrak{h}}^{-1}{\mathfrak{g}})italic_ρ ( fraktur_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_ρ ( fraktur_g ) = italic_ρ ( fraktur_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT fraktur_g ) of group representations,

(ρ(𝔥)xθ)(𝔤)=ρ(𝔥)x,ρ(𝔤)θ=x,ρ(𝔥1𝔤)θ=ρ(𝔥)(xθ)(𝔤).𝜌𝔥𝑥𝜃𝔤𝜌𝔥𝑥𝜌𝔤𝜃𝑥𝜌superscript𝔥1𝔤𝜃𝜌𝔥𝑥𝜃𝔤(\rho({\mathfrak{h}})x\star\theta)({\mathfrak{g}})=\langle\rho({\mathfrak{h}})x,\rho({\mathfrak{g}})\theta\rangle=\langle x,\rho({\mathfrak{h}}^{-1}{\mathfrak{g}})\theta\rangle=\rho({\mathfrak{h}})(x\star\theta)({\mathfrak{g}}).( italic_ρ ( fraktur_h ) italic_x ⋆ italic_θ ) ( fraktur_g ) = ⟨ italic_ρ ( fraktur_h ) italic_x , italic_ρ ( fraktur_g ) italic_θ ⟩ = ⟨ italic_x , italic_ρ ( fraktur_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT fraktur_g ) italic_θ ⟩ = italic_ρ ( fraktur_h ) ( italic_x ⋆ italic_θ ) ( fraktur_g ) .

Let us look at some examples. The case of one-dimensional grid we have studied above is obtained with the choice Ω=n={0,,n1}Ωsubscript𝑛0𝑛1\Omega=\mathbb{Z}_{n}=\{0,\ldots,n-1\}roman_Ω = blackboard_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { 0 , … , italic_n - 1 } and the cyclic shift group 𝔊=n𝔊subscript𝑛{\mathfrak{G}}=\mathbb{Z}_{n}fraktur_G = blackboard_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The group elements in this case are cyclic shifts of indices, i.e., an element 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G can be identified with some u=0,,n1𝑢0𝑛1u=0,\ldots,n-1italic_u = 0 , … , italic_n - 1 such that 𝔤.v=vumodnformulae-sequence𝔤𝑣𝑣𝑢mod𝑛{\mathfrak{g}}.v=v-u\,\mathrm{mod}\,nfraktur_g . italic_v = italic_v - italic_u roman_mod italic_n, whereas the inverse element is 𝔤1.v=v+umodnformulae-sequencesuperscript𝔤1𝑣𝑣𝑢mod𝑛{\mathfrak{g}}^{-1}.v=v+u\,\mathrm{mod}\,nfraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . italic_v = italic_v + italic_u roman_mod italic_n. Importantly, in this example the elements of the group (shifts) are also elements of the domain (indices). We thus can, with some abuse of notation, identify the two structures (i.e., Ω=𝔊Ω𝔊\Omega={\mathfrak{G}}roman_Ω = fraktur_G); our expression for the group convolution in this case

(xθ)(𝔤)=v=0n1xvθ𝔤1v,𝑥𝜃𝔤superscriptsubscript𝑣0𝑛1subscript𝑥𝑣subscript𝜃superscript𝔤1𝑣(x\star\theta)({\mathfrak{g}})=\sum_{v=0}^{n-1}x_{v}\,\theta_{{\mathfrak{g}}^{-1}v},( italic_x ⋆ italic_θ ) ( fraktur_g ) = ∑ start_POSTSUBSCRIPT italic_v = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_v end_POSTSUBSCRIPT ,

leads to the familiar convolution 717171Actually here again, this is cross-correlation.(xθ)u=v=0n1xvθv+umodnsubscript𝑥𝜃𝑢superscriptsubscript𝑣0𝑛1subscript𝑥𝑣subscript𝜃𝑣𝑢mod𝑛\displaystyle(x\star\theta)_{u}=\sum_{v=0}^{n-1}x_{v}\,\theta_{v+u\,\,\mathrm{mod}\,n}( italic_x ⋆ italic_θ ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_v + italic_u roman_mod italic_n end_POSTSUBSCRIPT.

Spherical convolution

Now consider 727272 [Uncaptioned image] Cosmic microwave background radiation, captured by the Planck space observatory, is a signal on 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. the two-dimensional sphere Ω=𝕊2Ωsuperscript𝕊2\Omega=\mathbb{S}^{2}roman_Ω = blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with the group of rotations, the special orthogonal group 𝔊=SO(3)𝔊SO3{\mathfrak{G}}=\mathrm{SO}(3)fraktur_G = roman_SO ( 3 ). While chosen for pedagogical reason, this example is actually very practical and arises in numerous applications. In astrophysics, for example, observational data often naturally has spherical geometry. Furthermore, spherical symmetries are very important in applications in chemistry when modeling molecules and trying to predict their properties, e.g. for the purpose of virtual drug screening.

Representing a point on the sphere as a three-dimensional unit vector 𝐮:𝐮=1:𝐮norm𝐮1\mathbf{u}:\|\mathbf{u}\|=1bold_u : ∥ bold_u ∥ = 1, the action of the group can be represented as a 3×3333\times 33 × 3 orthogonal matrix 𝐑𝐑\mathbf{R}bold_R with det(𝐑)=1det𝐑1\mathrm{det}(\mathbf{R})=1roman_det ( bold_R ) = 1. The spherical convolution can thus be written as the inner product between the signal and the rotated filter,

(xθ)(𝐑)=𝕊2x(𝐮)θ(𝐑1𝐮)d𝐮.𝑥𝜃𝐑subscriptsuperscript𝕊2𝑥𝐮𝜃superscript𝐑1𝐮differential-d𝐮(x\star\theta)(\mathbf{R})=\int_{\mathbb{S}^{2}}x(\textbf{u})\theta(\mathbf{R}^{-1}\mathbf{u})\mathrm{d}\mathbf{u}.( italic_x ⋆ italic_θ ) ( bold_R ) = ∫ start_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ( u ) italic_θ ( bold_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_u ) roman_d bold_u .

The first thing to note is than now the group is not identical to the domain: the group SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ) is a Lie group that is in fact a three-dimensional manifold, whereas 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a two-dimensional one. Consequently, in this case, unlike the previous example, the convolution is a function on SO(3)SO3\operatorname{SO}(3)roman_SO ( 3 ) rather than on ΩΩ\Omegaroman_Ω.

This has important practical consequences: in our Geometric Deep Learning blueprint, we concatenate multiple equivariant maps (“layers” in deep learning jargon) by applying a subsequent operator to the output of the previous one. In the case of translations, we can apply multiple convolutions in sequence, since their outputs are all defined on the same domain ΩΩ\Omegaroman_Ω. In the general setting, since xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ is a function on 𝔊𝔊{\mathfrak{G}}fraktur_G rather than on ΩΩ\Omegaroman_Ω, we cannot use exactly the same operation subsequently—it means that the next operation has to deal with signals on 𝔊𝔊{\mathfrak{G}}fraktur_G, i.e. x𝒳(𝔊)𝑥𝒳𝔊x\in\mathcal{X}({\mathfrak{G}})italic_x ∈ caligraphic_X ( fraktur_G ). Our definition of group convolution allows this case: we take as domain Ω=𝔊Ω𝔊\Omega={\mathfrak{G}}roman_Ω = fraktur_G acted on by 𝔊𝔊{\mathfrak{G}}fraktur_G itself via the group action (𝔤,𝔥)𝔤𝔥maps-to𝔤𝔥𝔤𝔥({\mathfrak{g}},{\mathfrak{h}})\mapsto{\mathfrak{g}}{\mathfrak{h}}( fraktur_g , fraktur_h ) ↦ fraktur_g fraktur_h defined by the composition operation of 𝔊𝔊{\mathfrak{G}}fraktur_G. This yields the representation ρ(𝔤)𝜌𝔤\rho({\mathfrak{g}})italic_ρ ( fraktur_g ) acting on x𝒳(𝔊)𝑥𝒳𝔊x\in\mathcal{X}({\mathfrak{G}})italic_x ∈ caligraphic_X ( fraktur_G ) by (ρ(𝔤)x)(𝔥)=x(𝔤1𝔥)𝜌𝔤𝑥𝔥𝑥superscript𝔤1𝔥(\rho({\mathfrak{g}})x)({\mathfrak{h}})=x({\mathfrak{g}}^{-1}{\mathfrak{h}})( italic_ρ ( fraktur_g ) italic_x ) ( fraktur_h ) = italic_x ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT fraktur_h )747474The representation of 𝔊𝔊{\mathfrak{G}}fraktur_G acting on functions defined on 𝔊𝔊{\mathfrak{G}}fraktur_G itself is called the regular representation of 𝔊𝔊{\mathfrak{G}}fraktur_G.. Just like before, the inner product is defined by integrating the point-wise product of the signal and the filter over the domain, which now equals Ω=𝔊Ω𝔊\Omega={\mathfrak{G}}roman_Ω = fraktur_G. In our example of spherical convolution, a second layer of convolution would thus have the form

((xθ)ϕ)(𝐑)=SO(3)(xθ)(𝐐)ϕ(𝐑1𝐐)d𝐐.𝑥𝜃italic-ϕ𝐑subscriptSO3𝑥𝜃𝐐italic-ϕsuperscript𝐑1𝐐differential-d𝐐((x\star\theta)\star\phi)(\mathbf{R})=\int_{\mathrm{SO}(3)}(x\star\theta)(\textbf{Q})\phi(\mathbf{R}^{-1}\mathbf{Q})\mathrm{d}\mathbf{Q}.( ( italic_x ⋆ italic_θ ) ⋆ italic_ϕ ) ( bold_R ) = ∫ start_POSTSUBSCRIPT roman_SO ( 3 ) end_POSTSUBSCRIPT ( italic_x ⋆ italic_θ ) ( Q ) italic_ϕ ( bold_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Q ) roman_d bold_Q .

Since convolution involves inner product that in turn requires integrating over the domain ΩΩ\Omegaroman_Ω, we can only use it on domains ΩΩ\Omegaroman_Ω that are small (in the discrete case) or low-dimensional (in the continuous case). For instance, we can use convolutions on the plane 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (two dimensional) or special orthogonal group SE(3)SE3\operatorname{SE}(3)roman_SE ( 3 ) (three dimensional), or on the finite set of nodes of a graph (n𝑛nitalic_n-dimensional), but we cannot in practice perform convolution on the group of permutations ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which has n!𝑛n!italic_n ! elements. Likewise, integrating over higher-dimensional groups like the affine group (containing translations, rotations, shearing and scaling, for a total of 6666 dimensions) is not feasible in practice. Nevertheless, as we have seen in Section 5.3, we can still build equivariant convolutions for large groups 𝔊𝔊{\mathfrak{G}}fraktur_G by working with signals defined on low-dimensional spaces ΩΩ\Omegaroman_Ω on which 𝔊𝔊{\mathfrak{G}}fraktur_G acts. Indeed, it is possible to show that any equivariant linear map f:𝒳(Ω)𝒳(Ω):𝑓𝒳Ω𝒳superscriptΩf:\mathcal{X}(\Omega)\rightarrow\mathcal{X}(\Omega^{\prime})italic_f : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) between two domains Ω,ΩΩsuperscriptΩ\Omega,\Omega^{\prime}roman_Ω , roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be written as a generalised convolution similar to the group convolution discussed here.

Second, we note that the Fourier transform we derived in the previous section from the shift-equivariance property of the convolution can also be extended to a more general case by projecting the signal onto the matrix elements of irreducible representations of the symmetry group. We will discuss this in future work. In the case of SO(3)SO3\operatorname{SO}(3)roman_SO ( 3 ) studied here, this gives rise to the spherical harmonics and Wigner D-functions, which find wide applications in quantum mechanics and chemistry.

Finally, we point to the assumption that has so far underpinned our discussion in this section: whether ΩΩ\Omegaroman_Ω was a grid, plane, or the sphere, we could transform every point into any other point, intuitively meaning that all the points on the domain “look the same.” A domain ΩΩ\Omegaroman_Ω with such property is called a homogeneous space, where for any u,vΩ𝑢𝑣Ωu,v\in\Omegaitalic_u , italic_v ∈ roman_Ω there exists 𝔤𝔊𝔤𝔊\mathfrak{g}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G such that 𝔤.u=vformulae-sequence𝔤𝑢𝑣\mathfrak{g}.u=vfraktur_g . italic_u = italic_v757575The additional properties, 𝔢.u=uformulae-sequence𝔢𝑢𝑢\mathfrak{e}.u=ufraktur_e . italic_u = italic_u and 𝔤(𝔥.u)=(𝔤𝔥).u\mathfrak{g}(\mathfrak{h}.u)=(\mathfrak{gh}).ufraktur_g ( fraktur_h . italic_u ) = ( fraktur_g fraktur_h ) . italic_u are tacitly assumed here. . In the next section we will try to relax this assumption.

4.4 Geodesics and Manifolds

In our last example, the sphere 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 767676As well as the group of rotations SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ), by virtue of it being a Lie group. was a manifold, albeit a special one with a global symmetry group due to its homogeneous structure. Unfortunately, this is not the case for the majority of manifolds, which typically do not have global symmetries. In this case, we cannot straightforwardly define an action of 𝔊𝔊{\mathfrak{G}}fraktur_G on the space of signals on ΩΩ\Omegaroman_Ω and use it to ‘slide’ filters around in order to define a convolution as a direct generalisation of the classical construction. Nevertheless, manifolds do have two types of invariance that we will explore in this section: transformations preserving metric structure and local reference frame change.

While for many machine learning readers manifolds might appear as somewhat exotic objects, they are in fact very common in various scientific domains. In physics, manifolds play a central role as the model of our Universe — according to Einstein’s General Relativity Theory, gravity arises from the curvature of the space-time, modeled as a pseudo-Riemannian manifold. In more ‘prosaic’ fields such as computer graphics and vision, manifolds are a common mathematical model of 3D shapes.777777The term ‘3D’ is somewhat misleading and refers to the embedding space. The shapes themselves are 2D manifolds (surfaces).The broad spectrum of applications of such models ranges from virtual and augmented reality and special effects obtained by means of ‘motion capture’ to structural biology dealing with protein interactions that stick together (‘bind’ in chemical jargon) like pieces of 3D puzzle. The common denominator of these applications is the use of a manifold to represent the boundary surface of some 3D object.

There are several reasons why such models are convenient.787878 [Uncaptioned image] The human body is an example of a non-rigid object deforming in a nearly-isometric way. First, they offer a compact description of the 3D object, eliminating the need to allocate memory to ‘empty space’ as is required in grid-based representations. Second, they allow to ignore the internal structure of the object. This is a handy property for example in structural biology where the internal folding of a protein molecule is often irrelevant for interactions that happen on the molecular surface. Third and most importantly, one often needs to deal with deformable objects that undergo non-rigid deformations. Our own body is one such example, and many applications in computer graphics and vision, such as the aforementioned motion capture and virtual avatars, require deformation invariance. Such deformations can be modelled very well as transformations that preserve the intrinsic structure of a (Riemannian) manifold, namely the distances between points measured along the manifold, without regard to the way the manifold is embedded in the ambient space.

We should emphasise that manifolds fall under the setting of varying domains in our Geometric Deep Learning blueprint, and in this sense are similar to graphs. We will highlight the importance of the notion of invariance to domain deformations – what we called ‘geometric stability’ in Section 3.3. Since differential geometry is perhaps less familiar to the machine learning audience, we will introduce the basic concepts required for our discussion and refer the reader to Penrose (2005) for their detailed exposition.

Riemannian manifolds

Since the formal definition of a manifold797979By ‘smooth’ we mean differentiable suffient number of times, which is tacitly assumed for convenience. ‘Deformed’ here means diffeomorphic, i.e., we can map between the two neighbourhoods using a smooth and invertible map with smooth inverse. is somewhat involved, we prefer to provide an intuitive picture at the expense of some precision. In this context, we can think of a (differentiable or smooth) manifold as a smooth multidimensional curved surface that is locally Euclidean, in the sense that any small neighbourhood around any point it can be deformed to a neighbourhood of ssuperscript𝑠\mathbb{R}^{s}blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT; in this case the manifold is said to be s𝑠sitalic_s-dimensional. This allows us to locally approximate the manifold around point u𝑢uitalic_u through the tangent space TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. The latter can be visualised by thinking of a prototypical two-dimensional manifold, the sphere, and attaching a plane to it at a point: with sufficient zoom, the spherical surface will seem planar (Figure 11). 808080Formally, the tangent bundle is the disjoint union TΩ=uΩTuΩ𝑇Ωsubscriptsquare-union𝑢Ωsubscript𝑇𝑢Ω\displaystyle T\Omega=\bigsqcup_{u\in\Omega}T_{u}\Omegaitalic_T roman_Ω = ⨆ start_POSTSUBSCRIPT italic_u ∈ roman_Ω end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω.The collection of all tangent spaces is called the tangent bundle, denoted TΩ𝑇ΩT\Omegaitalic_T roman_Ω; we will dwell on the concept of bundles in more detail in Section 4.5.

A tangent vector, which we denote by XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, can be thought of as a local displacement from point u𝑢uitalic_u. In order to measure the lengths of tangent vectors and angles between them, 818181A bilinear function g𝑔gitalic_g is said to be positive-definite if g(X,X)>0𝑔𝑋𝑋0g(X,X)>0italic_g ( italic_X , italic_X ) > 0 for any non-zero vector X0𝑋0X\neq 0italic_X ≠ 0. If g𝑔gitalic_g is expressed as a matrix 𝐆𝐆\mathbf{G}bold_G, it means 𝐆0succeeds𝐆0\mathbf{G}\succ 0bold_G ≻ 0. The determinant |𝐆|1/2superscript𝐆12|\mathbf{G}|^{1/2}| bold_G | start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT provides a local volume element, which does not depend on the choice of the basis. we need to equip the tangent space with additional structure, expressed as a positive-definite bilinear function gu:TuΩ×TuΩ:subscript𝑔𝑢subscript𝑇𝑢Ωsubscript𝑇𝑢Ωg_{u}:T_{u}\Omega\times T_{u}\Omega\rightarrow\mathbb{R}italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω × italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω → blackboard_R depending smoothly on u𝑢uitalic_u. Such a function is called a Riemannian metric, in honour of Bernhardt Riemann who introduced the concept in 1856, and can be thought of as an inner product on the tangent space, X,Yu=gu(X,Y)subscript𝑋𝑌𝑢subscript𝑔𝑢𝑋𝑌\langle X,Y\rangle_{u}=g_{u}(X,Y)⟨ italic_X , italic_Y ⟩ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X , italic_Y ), which is an expression of the angle between any two tangent vectors X,YTuΩ𝑋𝑌subscript𝑇𝑢ΩX,Y\in T_{u}\Omegaitalic_X , italic_Y ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. The metric also induces a norm Xu=gu1/2(X,X)subscriptnorm𝑋𝑢superscriptsubscript𝑔𝑢12𝑋𝑋\|X\|_{u}=g_{u}^{1/2}(X,X)∥ italic_X ∥ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_X , italic_X ) allowing to locally measure lengths of vectors.

We must stress that tangent vectors are abstract geometric entities that exists in their own right and are coordinate-free. If we are to express a tangent vector X𝑋Xitalic_X numerically as an array of numbers, we can only represent it as a list of coordinates 𝐱=(x1,,xs)𝐱subscript𝑥1subscript𝑥𝑠\mathbf{x}=(x_{1},\ldots,x_{s})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) relative to some local basis828282Unfortunately, too often vectors are identified with their coordinates. To emphasise this important difference, we use X𝑋Xitalic_X to denote a tangent vector and 𝐱𝐱\mathbf{x}bold_x to denote its coordinates. {X1,Xs}TuΩsubscript𝑋1normal-…subscript𝑋𝑠subscript𝑇𝑢normal-Ω\{X_{1},\ldots X_{s}\}\subseteq T_{u}\Omega{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } ⊆ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. Similarly, the metric can be expressed as an s×s𝑠𝑠s\times sitalic_s × italic_s matrix 𝐆𝐆\mathbf{G}bold_G with elements gij=gu(Xi,Xj)subscript𝑔𝑖𝑗subscript𝑔𝑢subscript𝑋𝑖subscript𝑋𝑗g_{ij}=g_{u}(X_{i},X_{j})italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in that basis. We will return to this point in Section 4.5.

Refer to caption
Figure 11: Basic notions of Riemannian geometry illustrated on the example of the two-dimensional sphere 𝕊2={𝐮3:𝐮=1}superscript𝕊2conditional-set𝐮superscript3norm𝐮1\mathbb{S}^{2}=\{\mathbf{u}\in\mathbb{R}^{3}:\|\mathbf{u}\|=1\}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = { bold_u ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT : ∥ bold_u ∥ = 1 }, realised a subset (sub-manifold) of 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The tangent space to the sphere is given as T𝐮𝕊2={𝐱3:𝐱𝐮=0}subscript𝑇𝐮superscript𝕊2conditional-set𝐱superscript3superscript𝐱top𝐮0T_{\mathbf{u}}\mathbb{S}^{2}=\{\mathbf{x}\in\mathbb{R}^{3}:\mathbf{x}^{\top}\mathbf{u}=0\}italic_T start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = { bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT : bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u = 0 } and is a 2D plane – hence this is a 2-dimensional manifold. The Riemannian metric is simply the Euclidean inner product restricted to the tangent plane, 𝐱,𝐲𝐮=𝐱𝐲subscript𝐱𝐲𝐮superscript𝐱top𝐲\langle\mathbf{x},\mathbf{y}\rangle_{\mathbf{u}}=\mathbf{x}^{\top}\mathbf{y}⟨ bold_x , bold_y ⟩ start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT = bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y for any 𝐱,𝐱T𝐮𝕊2𝐱𝐱subscript𝑇𝐮superscript𝕊2\mathbf{x},\mathbf{x}\in T_{\mathbf{u}}\mathbb{S}^{2}bold_x , bold_x ∈ italic_T start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The exponential map is given by exp𝐮(𝐱)=cos(𝐱)𝐮+sin(𝐱)𝐱𝐱subscript𝐮𝐱norm𝐱𝐮norm𝐱norm𝐱𝐱\exp_{\mathbf{u}}(\mathbf{x})=\cos(\|\mathbf{x}\|)\mathbf{u}+\frac{\sin(\|\mathbf{x}\|)}{\|\mathbf{x}\|}\mathbf{x}roman_exp start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ( bold_x ) = roman_cos ( ∥ bold_x ∥ ) bold_u + divide start_ARG roman_sin ( ∥ bold_x ∥ ) end_ARG start_ARG ∥ bold_x ∥ end_ARG bold_x, for 𝐱T𝐮𝕊2𝐱subscript𝑇𝐮superscript𝕊2\mathbf{x}\in T_{\mathbf{u}}\mathbb{S}^{2}bold_x ∈ italic_T start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Geodesics are great arcs of length d(𝐮,𝐯)=cos1(𝐮𝐯)𝑑𝐮𝐯superscript1superscript𝐮top𝐯d(\mathbf{u},\mathbf{v})=\cos^{-1}(\mathbf{u}^{\top}\mathbf{v})italic_d ( bold_u , bold_v ) = roman_cos start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v ).

A manifold equipped with a metric is called a Riemannian manifold and properties that can be expressed entirely in terms of the metric are said to be intrinsic. This is a crucial notion for our discussion, as according to our template, we will be seeking to construct functions acting on signals defined on ΩΩ\Omegaroman_Ω that are invariant to metric-preserving transformations called isometries that deform the manifold without affecting its local structure.838383 [Uncaptioned image] This result is known as the Embedding Theorem, due to Nash (1956). The art of origami is a manifestation of different isometric embeddings of the planar surface in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (Figure: Shutterstock/300 librarians). If such functions can be expressed in terms of intrinsic quantities, they are automatically guaranteed to be isometry-invariant and thus unaffected by isometric deformations. These results can be further extended to dealing with approximate isometries; this is thus an instance of the geometric stability (domain deformation) discussed in our blueprint.

While, as we noted, the definition of a Riemannian manifold does not require a geometric realisation in any space, it turns out that any smooth Riemannian manifold can be realised as a subset of a Euclidean space of sufficiently high dimension (in which case it is said to be ‘embedded’ in that space) by using the structure of the Euclidean space to induce a Riemannian metric. Such an embedding is however not necessarily unique – as we will see, two different isometric realisations of a Riemannian metric are possible.

Scalar and Vector fields

Since we are interested in signals defined on ΩΩ\Omegaroman_Ω, we need to provide the proper notion of scalar- and vector-valued functions on manifolds. A (smooth) scalar field is a function of the form x:Ω:𝑥Ωx:\Omega\rightarrow\mathbb{R}italic_x : roman_Ω → blackboard_R. 848484[Uncaptioned image]
Example of a scalar field.
Scalar fields form a vector space 𝒳(Ω,)𝒳Ω\mathcal{X}(\Omega,\mathbb{R})caligraphic_X ( roman_Ω , blackboard_R ) that can be equipped with the inner product

x,y=Ωx(u)y(u)du,𝑥𝑦subscriptΩ𝑥𝑢𝑦𝑢differential-d𝑢\langle x,y\rangle=\int_{\Omega}x(u)y(u)\mathrm{d}u,⟨ italic_x , italic_y ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_x ( italic_u ) italic_y ( italic_u ) roman_d italic_u , (15)

where dud𝑢\mathrm{d}uroman_d italic_u is the volume element induced by the Riemannian metric. A (smooth) tangent vector field is a function of the form X:ΩTΩ:𝑋Ω𝑇ΩX:\Omega\rightarrow T\Omegaitalic_X : roman_Ω → italic_T roman_Ω assigning to each point a tangent vector in the respective tangent space, uX(u)TuΩmaps-to𝑢𝑋𝑢subscript𝑇𝑢Ωu\mapsto X(u)\in T_{u}\Omegaitalic_u ↦ italic_X ( italic_u ) ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. Vector fields858585[Uncaptioned image]
Example of a vector field. The fields are typically assumed to be of the same regularity class (smoothness) as the manifold itself.
also form a vector space 𝒳(Ω,TΩ)𝒳Ω𝑇Ω\mathcal{X}(\Omega,T\Omega)caligraphic_X ( roman_Ω , italic_T roman_Ω ) with the inner product defined through the Riemannian metric,

X,Y=Ωgu(X(u),Y(u))du.𝑋𝑌subscriptΩsubscript𝑔𝑢𝑋𝑢𝑌𝑢differential-d𝑢\langle X,Y\rangle=\int_{\Omega}g_{u}(X(u),Y(u))\mathrm{d}u.⟨ italic_X , italic_Y ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ( italic_u ) , italic_Y ( italic_u ) ) roman_d italic_u . (16)

Intrinsic gradient

Another way to think of (and actually define) vector fields is as a generalised notion of derivative. In classical calculus, one can locally linearise a (smooth) function through the differential dx(u)=x(u+du)x(u)d𝑥𝑢𝑥𝑢d𝑢𝑥𝑢\mathrm{d}x(u)=x(u+\mathrm{d}u)-x(u)roman_d italic_x ( italic_u ) = italic_x ( italic_u + roman_d italic_u ) - italic_x ( italic_u ), which provides the change of the value of the function x𝑥xitalic_x at point u𝑢uitalic_u as a result of an inifinitesimal displacement dud𝑢\mathrm{d}uroman_d italic_u. However, in our case the naïve use of this definition is impossible, since expressions of the form “u+du𝑢d𝑢u+\mathrm{d}uitalic_u + roman_d italic_u” are meaningless on manifolds due to the lack of a global vector space structure.

The solution is to use tangent vectors as a model of local infinitesimal displacement. Given a smooth scalar field x𝒳(Ω,)𝑥𝒳Ωx\in\mathcal{X}(\Omega,\mathbb{R})italic_x ∈ caligraphic_X ( roman_Ω , blackboard_R ), we can think of a (smooth) vector field as a linear map Y:𝒳(Ω,)𝒳(Ω,):𝑌𝒳Ω𝒳ΩY:\mathcal{X}(\Omega,\mathbb{R})\rightarrow\mathcal{X}(\Omega,\mathbb{R})italic_Y : caligraphic_X ( roman_Ω , blackboard_R ) → caligraphic_X ( roman_Ω , blackboard_R ) satisfying the properties of a derivation: Y(c)=0𝑌𝑐0Y(c)=0italic_Y ( italic_c ) = 0 for any constant c𝑐citalic_c (corresponding to the intuition that constant functions have vanishing derivatives), Y(x+z)=Y(x)+Y(z)𝑌𝑥𝑧𝑌𝑥𝑌𝑧Y(x+z)=Y(x)+Y(z)italic_Y ( italic_x + italic_z ) = italic_Y ( italic_x ) + italic_Y ( italic_z ) (linearity), and Y(xz)=Y(x)z+xY(z)𝑌𝑥𝑧𝑌𝑥𝑧𝑥𝑌𝑧Y(xz)=Y(x)z+xY(z)italic_Y ( italic_x italic_z ) = italic_Y ( italic_x ) italic_z + italic_x italic_Y ( italic_z ) (product or Leibniz rule), for any smooth scalar fields x,z𝒳(Ω,)𝑥𝑧𝒳Ωx,z\in\mathcal{X}(\Omega,\mathbb{R})italic_x , italic_z ∈ caligraphic_X ( roman_Ω , blackboard_R ). It can be shown that one can use these properties to define vector fields axiomatically. The differential dx(Y)=Y(x)d𝑥𝑌𝑌𝑥\mathrm{d}x(Y)=Y(x)roman_d italic_x ( italic_Y ) = italic_Y ( italic_x ) can be viewed as an operator (u,Y)Y(x)maps-to𝑢𝑌𝑌𝑥(u,Y)\mapsto Y(x)( italic_u , italic_Y ) ↦ italic_Y ( italic_x ) and interpreted as follows: the change of x𝑥xitalic_x as the result of displacement YTuΩ𝑌subscript𝑇𝑢ΩY\in T_{u}\Omegaitalic_Y ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω at point u𝑢uitalic_u is given by dux(Y)subscriptd𝑢𝑥𝑌\mathrm{d}_{u}x(Y)roman_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_x ( italic_Y ). 868686Importantly, this construction does not use the Riemannian metric whatsoever and can thus can be extended to a more general construction of bundles discussed in the Section 4.5. It is thus an extension of the classical notion of directional derivative.

Alternatively, at each point u𝑢uitalic_u the differential can be regarded as a linear functional dxu:TuΩ:dsubscript𝑥𝑢subscript𝑇𝑢Ω\mathrm{d}x_{u}:T_{u}\Omega\rightarrow\mathbb{R}roman_d italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω → blackboard_R acting on tangent vectors XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. Linear functionals on a vector space are called dual vectors or covectors; if in addition we are given an inner product (Riemannian metric), a dual vector can always be represented as

dxu(X)=gu(x(u),X).dsubscript𝑥𝑢𝑋subscript𝑔𝑢𝑥𝑢𝑋\mathrm{d}x_{u}(X)=g_{u}(\nabla x(u),X).roman_d italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ) = italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ∇ italic_x ( italic_u ) , italic_X ) .

The representation of the differential at point u𝑢uitalic_u is a tangent vector x(u)TuΩ𝑥𝑢subscript𝑇𝑢Ω\nabla x(u)\in T_{u}\Omega∇ italic_x ( italic_u ) ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω called the (intrinsic) gradient of x𝑥xitalic_x; similarly to the gradient in classical calculus, it can be thought of as the direction of the steepest increase of x𝑥xitalic_x. The gradient considered as an operator :𝒳(Ω,)𝒳(Ω,TΩ):𝒳Ω𝒳Ω𝑇Ω\nabla:\mathcal{X}(\Omega,\mathbb{R})\rightarrow\mathcal{X}(\Omega,T\Omega)∇ : caligraphic_X ( roman_Ω , blackboard_R ) → caligraphic_X ( roman_Ω , italic_T roman_Ω ) assigns at each point x(u)x(u)TuΩmaps-to𝑥𝑢𝑥𝑢subscript𝑇𝑢Ωx(u)\mapsto\nabla x(u)\in T_{u}\Omegaitalic_x ( italic_u ) ↦ ∇ italic_x ( italic_u ) ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω; thus, the gradient of a scalar field x𝑥xitalic_x is a vector field x𝑥\nabla x∇ italic_x.

Geodesics

Now consider a smooth curve γ:[0,T]Ω:𝛾0𝑇Ω\gamma:[0,T]\rightarrow\Omegaitalic_γ : [ 0 , italic_T ] → roman_Ω on the manifold with endpoints u=γ(0)𝑢𝛾0u=\gamma(0)italic_u = italic_γ ( 0 ) and v=γ(T)𝑣𝛾𝑇v=\gamma(T)italic_v = italic_γ ( italic_T ). The derivative of the curve at point t𝑡titalic_t is a tangent vector γ(t)Tγ(t)Ωsuperscript𝛾𝑡subscript𝑇𝛾𝑡Ω\gamma^{\prime}(t)\in T_{\gamma(t)}\Omegaitalic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ∈ italic_T start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT roman_Ω called the velocity vector. 888888It is tacitly assumed that curves are given in arclength parametrisation, such that γ=1normsuperscript𝛾1\|\gamma^{\prime}\|=1∥ italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ = 1 (constant velocity).Among all the curves connecting points u𝑢uitalic_u and v𝑣vitalic_v, we are interested in those of minimum length, i.e., we are seeking γ𝛾\gammaitalic_γ minimising the length functional

(γ)=0Tγ(t)γ(t)dt=0Tgγ(t)1/2(γ(t),γ(t))dt.𝛾superscriptsubscript0𝑇subscriptnormsuperscript𝛾𝑡𝛾𝑡differential-d𝑡superscriptsubscript0𝑇superscriptsubscript𝑔𝛾𝑡12superscript𝛾𝑡superscript𝛾𝑡differential-d𝑡\ell(\gamma)=\int_{0}^{T}\|\gamma^{\prime}(t)\|_{\gamma(t)}\mathrm{d}t=\int_{0}^{T}g_{\gamma(t)}^{1/2}(\gamma^{\prime}(t),\gamma^{\prime}(t))\mathrm{d}t.roman_ℓ ( italic_γ ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT roman_d italic_t = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) , italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ) roman_d italic_t .

Such curves are called geodesics (from the Greek γεοδαιςία, literally ‘division of Earth’) and they play important role in differential geometry. Crucially to our discussion, the way we defined geodesics is intrinsic, as they depend solely on the Riemannian metric (through the length functional).

Readers familiar with differential geometry might recall that geodesics are a more general concept and their definition in fact does not necessarily require a Riemannian metric but a connection (also called a covariant derivative, as it generalises the notion of derivative to vector and tensor fields), which is defined axiomatically, similarly to our construction of the differential. Given a Riemannian metric, there exists a unique special connection called the898989The Levi-Civita connection is torsion-free and compatible with the metric. The Fundamental Theorem of Riemannian geometry guarantees its existence and uniqueness. Levi-Civita connection which is often tacitly assumed in Riemannian geometry. Geodesics arising from this connection are the length-minimising curves we have defined above.

We will show next how to use geodesics to define a way to transport tangent vectors on the manifold (parallel transport), create local intrinsic maps from the manifold to the tangent space (exponential map), and define distances (geodesic metric). This will allow us to construct convolution-like operations by applying a filter locally in the tangent space.

Parallel transport

909090[Uncaptioned image] Euclidean transport of a vector from A to C makes no sense on the sphere, as the resulting vectors (red) are not in the tangent plane. Parallel transport from A to C (blue) rotates the vector along the path. It is path dependent: going along the path BC and ABC produces different results.

One issue we have already encountered when dealing with manifolds is that we cannot directly add or subtract two points u,vΩ𝑢𝑣Ωu,v\in\Omegaitalic_u , italic_v ∈ roman_Ω. The same problem arises when trying to compare tangent vectors at different points: though they have the same dimension, they belong to different spaces, e.g. XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω and YTvΩ𝑌subscript𝑇𝑣ΩY\in T_{v}\Omegaitalic_Y ∈ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω, and thus not directly comparable. Geodesics provide a mechanism to move vectors from one point to another, in the following way: let γ𝛾\gammaitalic_γ be a geodesic connecting points u=γ(0)𝑢𝛾0u=\gamma(0)italic_u = italic_γ ( 0 ) and v=γ(T)𝑣𝛾𝑇v=\gamma(T)italic_v = italic_γ ( italic_T ) and let XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. We can define a new set of tangent vectors along the geodesic, X(t)Tγ(t)Ω𝑋𝑡subscript𝑇𝛾𝑡ΩX(t)\in T_{\gamma(t)}\Omegaitalic_X ( italic_t ) ∈ italic_T start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT roman_Ω such that the length of X(t)𝑋𝑡X(t)italic_X ( italic_t ) and the angle (expressed through the Riemannian metric) between it and the velocity vector of the curve is constant,

gγ(t)(X(t),γ(t))=gu(X,γ(0))=const,X(t)γ(t)=Xu=const.formulae-sequencesubscript𝑔𝛾𝑡𝑋𝑡superscript𝛾𝑡subscript𝑔𝑢𝑋superscript𝛾0constsubscriptnorm𝑋𝑡𝛾𝑡subscriptnorm𝑋𝑢constg_{\gamma(t)}(X(t),\gamma^{\prime}(t))=g_{u}(X,\gamma^{\prime}(0))=\mathrm{const},\quad\quad\|X(t)\|_{\gamma(t)}=\|X\|_{u}=\mathrm{const}.italic_g start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT ( italic_X ( italic_t ) , italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ) = italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X , italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 0 ) ) = roman_const , ∥ italic_X ( italic_t ) ∥ start_POSTSUBSCRIPT italic_γ ( italic_t ) end_POSTSUBSCRIPT = ∥ italic_X ∥ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_const .

As a result, we get a unique vector X(T)TvΩ𝑋𝑇subscript𝑇𝑣ΩX(T)\in T_{v}\Omegaitalic_X ( italic_T ) ∈ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω at the end point v𝑣vitalic_v.

The map Γuv(X):TuΩTuΩ:subscriptΓ𝑢𝑣𝑋subscript𝑇𝑢Ωsubscript𝑇𝑢Ω\Gamma_{u\rightarrow v}(X):T_{u}\Omega\rightarrow T_{u}\Omegaroman_Γ start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT ( italic_X ) : italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω → italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω and TvΩsubscript𝑇𝑣ΩT_{v}\Omegaitalic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω defined as Γuv(X)=X(T)subscriptΓ𝑢𝑣𝑋𝑋𝑇\Gamma_{u\rightarrow v}(X)=X(T)roman_Γ start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT ( italic_X ) = italic_X ( italic_T ) using the above notation is called parallel transport or connection; the latter term implying it is a mechanism to ‘connect’ between the tangent spaces TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω and TvΩsubscript𝑇𝑣ΩT_{v}\Omegaitalic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω. Due to the angle and length preservation conditions, parallel transport amounts to only rotation of the vector, so it can be associated with an element of the special orthogonal group SO(s)SO𝑠\mathrm{SO}(s)roman_SO ( italic_s ) (called the structure group of the tangent bundle),919191Assuming that the manifold is orientable, otherwise O(s)O𝑠\mathrm{O}(s)roman_O ( italic_s ).which we will denote by 𝔤uvsubscript𝔤𝑢𝑣{\mathfrak{g}}_{u\rightarrow v}fraktur_g start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT and discuss in further detail in Section 4.5.

As we mentioned before, a connection can be defined axiomatically independently of the Riemannian metric, providing thus an abstract notion of parallel transport along any smooth curve. The result of such transport, however, depends on the path taken.

Exponential map

Locally around a point u𝑢uitalic_u, it is always possible to define a unique geodesic in a given direction XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, i.e. such that γ(0)=u𝛾0𝑢\gamma(0)=uitalic_γ ( 0 ) = italic_u and γ(0)=Xsuperscript𝛾0𝑋\gamma^{\prime}(0)=Xitalic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 0 ) = italic_X. When γX(t)subscript𝛾𝑋𝑡\gamma_{X}(t)italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_t ) is defined for all t0𝑡0t\geq 0italic_t ≥ 0 (that is, we can shoot the geodesic from a point u𝑢uitalic_u for as long as we like), the manifold is said to be geodesically complete and the exponential map is defined on the whole tangent space. Since compact manifolds are geodesically complete, we can tacitly assume this convenient property.

This definition of geodesic provided a point and a direction gives a natural mapping from (a subset of) the tangent space TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω to ΩΩ\Omegaroman_Ω called the exponential map929292 Note that geodesic completeness does not necessarily guarantee that exp\exproman_exp is a global diffeomorphism – the largest radius r𝑟ritalic_r about u𝑢uitalic_u for which expu(Br(0)TuΩ)subscript𝑢subscript𝐵𝑟0subscript𝑇𝑢normal-Ω\exp_{u}(B_{r}(0)\subseteq T_{u}\Omega)roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( 0 ) ⊆ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω ) is mapped diffeomorphically is called the injectivity radius. exp:Br(0)TuΩΩnormal-:subscript𝐵𝑟0subscript𝑇𝑢normal-Ωnormal-→normal-Ω\exp:B_{r}(0)\subset T_{u}\Omega\rightarrow\Omegaroman_exp : italic_B start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( 0 ) ⊂ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω → roman_Ω, which is defined by taking a unit step along the geodesic in the direction X𝑋Xitalic_X, i.e., expu(X)=γX(1)subscript𝑢𝑋subscript𝛾𝑋1\exp_{u}(X)=\gamma_{X}(1)roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ) = italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( 1 ). The exponential map expusubscript𝑢\exp_{u}roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is a local diffeomorphism, as it deforms the neighbourhood Br(0)subscript𝐵𝑟0B_{r}(0)italic_B start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( 0 ) (a ball or radius r𝑟ritalic_r) of the origin on TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω into a neighbourhood of u𝑢uitalic_u. Conversely, one can also regard the exponential map as an intrinsic local deformation (‘flattening’) of the manifold into the tangent space.

Geodesic distances

A result known as the Hopf-Rinow Theorem 939393Hopf-Rinow Theorem thus estabilishes the equivalence between geodesic and metric completeness, the latter meaning every Cauchy sequence converges in the geodesic distance metric.guarantees that geodesically complete manifolds are also complete metric spaces, in which one can realise a distance (called the geodesic distance or metric) between any pair of points u,v𝑢𝑣u,vitalic_u , italic_v as the length of the shortest path between them

dg(u,v)=minγ(γ)s.t.γ(0)=u,γ(T)=v,formulae-sequencesubscript𝑑𝑔𝑢𝑣subscript𝛾𝛾s.t.formulae-sequence𝛾0𝑢𝛾𝑇𝑣d_{g}(u,v)=\min_{\gamma}\ell(\gamma)\quad\quad\text{s.t.}\quad\quad\gamma(0)=u,\,\,\gamma(T)=v,italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_u , italic_v ) = roman_min start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT roman_ℓ ( italic_γ ) s.t. italic_γ ( 0 ) = italic_u , italic_γ ( italic_T ) = italic_v ,

which exists (i.e., the minimum is attained).949494Note that the term ‘metric’ is used in two senses: Riemannian metric g𝑔gitalic_g and distance d𝑑ditalic_d. To avoid confusion, we will use the term ‘distance’ referring to the latter. Our notation dgsubscript𝑑𝑔d_{g}italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT makes the distance depend on the Riemannian metric g𝑔gitalic_g, though the definition of geodesic length L𝐿Litalic_L.

Isometries

Consider now a deformation of our manifold ΩΩ\Omegaroman_Ω into another manifold Ω~~Ω\tilde{\Omega}over~ start_ARG roman_Ω end_ARG with a Riemannian metric hhitalic_h, which we assume to be a diffeomorphism η:(Ω,g)(Ω~,h):𝜂Ω𝑔~Ω\eta:(\Omega,g)\rightarrow(\tilde{\Omega},h)italic_η : ( roman_Ω , italic_g ) → ( over~ start_ARG roman_Ω end_ARG , italic_h ) between the manifolds. Its differential dη:TΩTΩ~:d𝜂𝑇Ω𝑇~Ω\mathrm{d}\eta:T\Omega\rightarrow T\tilde{\Omega}roman_d italic_η : italic_T roman_Ω → italic_T over~ start_ARG roman_Ω end_ARG defines a map between the respective tangent bundles (referred to as pushforward), such that at a point u𝑢uitalic_u, we have dηu:TuΩTη(u)Ω~:dsubscript𝜂𝑢subscript𝑇𝑢Ωsubscript𝑇𝜂𝑢~Ω\mathrm{d}\eta_{u}:T_{u}\Omega\rightarrow T_{\eta(u)}\tilde{\Omega}roman_d italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω → italic_T start_POSTSUBSCRIPT italic_η ( italic_u ) end_POSTSUBSCRIPT over~ start_ARG roman_Ω end_ARG, interpreted as before: if we make a small displacement from point u𝑢uitalic_u by tangent vector XTuΩ𝑋subscript𝑇𝑢ΩX\in T_{u}\Omegaitalic_X ∈ italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, the map η𝜂\etaitalic_η will be displaced from point η(u)𝜂𝑢\eta(u)italic_η ( italic_u ) by tangent vector dηu(X)Tη(u)Ω~dsubscript𝜂𝑢𝑋subscript𝑇𝜂𝑢~Ω\mathrm{d}\eta_{u}(X)\in T_{\eta(u)}\tilde{\Omega}roman_d italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ) ∈ italic_T start_POSTSUBSCRIPT italic_η ( italic_u ) end_POSTSUBSCRIPT over~ start_ARG roman_Ω end_ARG.

Since the pushforward959595Pushforward and pullback are adjoint operators η*α,X=α,η*Xsuperscript𝜂𝛼𝑋𝛼subscript𝜂𝑋\langle\eta^{*}\alpha,X\rangle=\langle\alpha,\eta_{*}X\rangle⟨ italic_η start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_α , italic_X ⟩ = ⟨ italic_α , italic_η start_POSTSUBSCRIPT * end_POSTSUBSCRIPT italic_X ⟩ where αT*Ω𝛼superscript𝑇Ω\alpha\in T^{*}\Omegaitalic_α ∈ italic_T start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT roman_Ω is a dual vector field, defined at each point as a linear functional acting on TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω and the inner products are defined respectively on vector and dual vector fields. provides a mechanism to associate tangent vectors on the two manifolds, it allows to pullback the metric hhitalic_h from Ω~~Ω\tilde{\Omega}over~ start_ARG roman_Ω end_ARG to ΩΩ\Omegaroman_Ω,

(η*h)u(X,Y)=hη(u)(dηu(X),dηu(Y))subscriptsuperscript𝜂𝑢𝑋𝑌subscript𝜂𝑢dsubscript𝜂𝑢𝑋dsubscript𝜂𝑢𝑌(\eta^{*}h)_{u}(X,Y)=h_{\eta(u)}(\mathrm{d}\eta_{u}(X),\mathrm{d}\eta_{u}(Y))( italic_η start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_h ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X , italic_Y ) = italic_h start_POSTSUBSCRIPT italic_η ( italic_u ) end_POSTSUBSCRIPT ( roman_d italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ) , roman_d italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_Y ) )

If the pullback metric coincides at every point with that of ΩΩ\Omegaroman_Ω, i.e., g=η*h𝑔superscript𝜂g=\eta^{*}hitalic_g = italic_η start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_h, the map η𝜂\etaitalic_η is called (a Riemannian) isometry. For two-dimensional manifolds (surfaces), isometries can be intuitively understood as inelastic deformations that deform the manifold without ‘stretching’ or ‘tearing’ it.

By virtue of their definition, isometries preserve intrinsic structures such as geodesic distances, which are expressed entirely in terms of the Riemannian metric. Therefore, we can also understand isometries from the position of metric geometry, as distance-preserving maps (‘metric isometries’) between metric spaces η:(Ω,dg)(Ω~,dh):𝜂Ωsubscript𝑑𝑔~Ωsubscript𝑑\eta:(\Omega,d_{g})\rightarrow(\tilde{\Omega},d_{h})italic_η : ( roman_Ω , italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) → ( over~ start_ARG roman_Ω end_ARG , italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), in the sense that

dg(u,v)=dh(η(u),η(v))subscript𝑑𝑔𝑢𝑣subscript𝑑𝜂𝑢𝜂𝑣d_{g}(u,v)=d_{h}(\eta(u),\eta(v))italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_u , italic_v ) = italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_η ( italic_u ) , italic_η ( italic_v ) )

for all u,vΩ𝑢𝑣Ωu,v\in\Omegaitalic_u , italic_v ∈ roman_Ω, or more compactly, dg=dh(η×η)subscript𝑑𝑔subscript𝑑𝜂𝜂d_{g}=d_{h}\circ(\eta\times\eta)italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∘ ( italic_η × italic_η ). In other words, Riemannian isometries are also metric isometries. On connected manifolds, the converse is also true: every metric isometry is also a Riemannian isometry. 969696This result is known as the Myers–Steenrod Theorem. We tacitly assume our manifolds to be connected.

In our Geometric Deep Learning blueprint, η𝜂\etaitalic_η is a model of domain deformations. When η𝜂\etaitalic_η is an isometry, any intrinsic quantities are unaffected by such deformations. One can generalise exact (metric) isometries through the notions of metric dilation

dil(η)=supuvΩdh(η(u),η(v))dg(u,v)dil𝜂subscriptsupremum𝑢𝑣Ωsubscript𝑑𝜂𝑢𝜂𝑣subscript𝑑𝑔𝑢𝑣\mathrm{dil}(\eta)=\sup_{u\neq v\in\Omega}\frac{d_{h}(\eta(u),\eta(v))}{d_{g}(u,v)}roman_dil ( italic_η ) = roman_sup start_POSTSUBSCRIPT italic_u ≠ italic_v ∈ roman_Ω end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_η ( italic_u ) , italic_η ( italic_v ) ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG

or metric distortion

dis(η)=supu,vΩ|dh(η(u),η(v))dg(u,v)|,dis𝜂subscriptsupremum𝑢𝑣Ωsubscript𝑑𝜂𝑢𝜂𝑣subscript𝑑𝑔𝑢𝑣\mathrm{dis}(\eta)=\sup_{u,v\in\Omega}|d_{h}(\eta(u),\eta(v))-d_{g}(u,v)|,roman_dis ( italic_η ) = roman_sup start_POSTSUBSCRIPT italic_u , italic_v ∈ roman_Ω end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_η ( italic_u ) , italic_η ( italic_v ) ) - italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_u , italic_v ) | ,

which capture the relative and absolute change of the geodesic distances under η𝜂\etaitalic_η, respectively. The condition (5) for the stability of a function f(𝒳(Ω))𝑓𝒳Ωf\in\mathcal{F}(\mathcal{X}(\Omega))italic_f ∈ caligraphic_F ( caligraphic_X ( roman_Ω ) ) under domain deformation can be rewritten in this case as

f(x,Ω)f(xη1,Ω~)Cxdis(η).norm𝑓𝑥Ω𝑓𝑥superscript𝜂1~Ω𝐶norm𝑥dis𝜂\|f(x,\Omega)-f(x\circ\eta^{-1},\tilde{\Omega})\|\leq C\|x\|\mathrm{dis}(\eta).∥ italic_f ( italic_x , roman_Ω ) - italic_f ( italic_x ∘ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , over~ start_ARG roman_Ω end_ARG ) ∥ ≤ italic_C ∥ italic_x ∥ roman_dis ( italic_η ) .

Intrinsic symmetries

A particular case of the above is a diffeomorphism of the domain itself (what we termed automorphism in Section 3.2), which we will denote by τDiff(Ω)𝜏DiffΩ\tau\in\mathrm{Diff}(\Omega)italic_τ ∈ roman_Diff ( roman_Ω ). We will call it a Riemannian (self-)isometry if the pullback metric satisfies τ*g=gsuperscript𝜏𝑔𝑔\tau^{*}g=gitalic_τ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_g = italic_g, or a metric (self-)isometry if dg=dg(τ×τ)subscript𝑑𝑔subscript𝑑𝑔𝜏𝜏d_{g}=d_{g}\circ(\tau\times\tau)italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∘ ( italic_τ × italic_τ ). Not surprisingly,989898Continuous symmetries on manifolds are infinitesimally generated by special tangent vector fields called Killing fields, named after Wilhelm Killing.isometries form a group with the composition operator denoted by Iso(Ω)IsoΩ\mathrm{Iso}(\Omega)roman_Iso ( roman_Ω ) and called the isometry group; the identity element is the map τ(u)=u𝜏𝑢𝑢\tau(u)=uitalic_τ ( italic_u ) = italic_u and the inverse always exists (by definition of τ𝜏\tauitalic_τ as a diffeomorphism). Self-isometries are thus intrinsic symmetries of manifolds.

Fourier analysis on Manifolds

We will now show how to construct intrinsic convolution-like operations on manifolds, which, by construction, will be invariant to isometric deformations. For this purpose, we have two options: One is to use an analogy of the Fourier transform, and define the convolution as a product in the Fourier domain. The other is to define the convolution spatially, by correlating a filter locally with the signal. Let us discuss the spectral approach first.

We remind that in the Euclidean domain the Fourier transform is obtained as the eigenvectors of circulant matrices, which are jointly diagonalisable due to their commutativity. Thus, any circulant matrix and in particular, differential operator, can be used to define an analogy of the Fourier transform on general domains. In Riemannian geometry, it is common to use the orthogonal eigenbasis of the Laplacian operator, which we will define here.

For this purpose, recall our definition of the intrinsic gradient operator :𝒳(Ω,)𝒳(Ω,TΩ):𝒳Ω𝒳Ω𝑇Ω\nabla:\mathcal{X}(\Omega,\mathbb{R})\rightarrow\mathcal{X}(\Omega,T\Omega)∇ : caligraphic_X ( roman_Ω , blackboard_R ) → caligraphic_X ( roman_Ω , italic_T roman_Ω ), producing a tangent vector field that indicates the local direction of steepest increase of a scalar field on the manifold. In a similar manner, we can define the divergence operator *:𝒳(Ω,TΩ)𝒳(Ω,):superscript𝒳Ω𝑇Ω𝒳Ω\nabla^{*}:\mathcal{X}(\Omega,T\Omega)\rightarrow\mathcal{X}(\Omega,\mathbb{R})∇ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : caligraphic_X ( roman_Ω , italic_T roman_Ω ) → caligraphic_X ( roman_Ω , blackboard_R ). If we think of a tangent vector field as a flow on the manifold, the divergence measures the net flow of a field at a point, allowing to distinguish between field ‘sources’ and ‘sinks’. We use the notation *superscript\nabla^{*}∇ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (as opposed to the common divdiv\mathrm{div}roman_div) to emphasise that the two operators are adjoint,

X,x=*X,x,𝑋𝑥superscript𝑋𝑥\langle X,\nabla x\rangle=\langle\nabla^{*}X,x\rangle,⟨ italic_X , ∇ italic_x ⟩ = ⟨ ∇ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_X , italic_x ⟩ ,

where we use the inner products (15) and (16) between scalar and vector fields.

The Laplacian (also known as the Laplace-Beltrami operator in differential geometry) is an operator on 𝒳(Ω)𝒳Ω\mathcal{X}(\Omega)caligraphic_X ( roman_Ω ) defined as Δ=*Δsuperscript\Delta=\nabla^{*}\nablaroman_Δ = ∇ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∇, which can be interpreted 999999From this interpretation it is also clear that the Laplacian is isotropic. We will see in Section 4.6 that it is possible to define anisotropic Laplacians (see (Andreux et al., 2014; Boscaini et al., 2016b)) of the form *(A(u))superscript𝐴𝑢\nabla^{*}(A(u)\nabla)∇ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_A ( italic_u ) ∇ ), where A(u)𝐴𝑢A(u)italic_A ( italic_u ) is a position-dependent tensor determining local direction. as the difference between the average of a function on an infinitesimal sphere around a point and the value of the function at the point itself. It is one of the most important operators in mathematical physics, used to describe phenomena as diverse as heat diffusion, quantum oscillations, and wave propagation. Importantly in our context, the Laplacian is intrinsic, and thus invariant under isometries of ΩΩ\Omegaroman_Ω.

It is easy to see that the Laplacian is self-adjoint (‘symmetric’),

x,x=x,Δx=Δx,x.𝑥𝑥𝑥Δ𝑥Δ𝑥𝑥\langle\nabla x,\nabla x\rangle=\langle x,\Delta x\rangle=\langle\Delta x,x\rangle.⟨ ∇ italic_x , ∇ italic_x ⟩ = ⟨ italic_x , roman_Δ italic_x ⟩ = ⟨ roman_Δ italic_x , italic_x ⟩ .

The quadratic form on the left in the above expression is actually the already familiar Dirichlet energy,

c2(x)=x2=x,x=Ωx(u)u2du=Ωgu(x(u),x(u))dusuperscript𝑐2𝑥superscriptnorm𝑥2𝑥𝑥subscriptΩsuperscriptsubscriptnorm𝑥𝑢𝑢2differential-d𝑢subscriptΩsubscript𝑔𝑢𝑥𝑢𝑥𝑢differential-d𝑢c^{2}(x)=\|\nabla x\|^{2}=\langle\nabla x,\nabla x\rangle=\int_{\Omega}\|\nabla x(u)\|_{u}^{2}\mathrm{d}u=\int_{\Omega}g_{u}(\nabla x(u),\nabla x(u))\mathrm{d}uitalic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) = ∥ ∇ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ⟨ ∇ italic_x , ∇ italic_x ⟩ = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ ∇ italic_x ( italic_u ) ∥ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_u = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ∇ italic_x ( italic_u ) , ∇ italic_x ( italic_u ) ) roman_d italic_u

measuring the smoothness of x𝑥xitalic_x.

The Laplacian operator admits an eigedecomposition

Δφk=λkφk,k=0,1,formulae-sequenceΔsubscript𝜑𝑘subscript𝜆𝑘subscript𝜑𝑘𝑘01\Delta\varphi_{k}=\lambda_{k}\varphi_{k},\quad\quad k=0,1,\ldotsroman_Δ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 0 , 1 , …

with countable spectrum if the manifold is compact (which we tacitly assume), and orthogonal eigenfunctions, φk,φl=δklsubscript𝜑𝑘subscript𝜑𝑙subscript𝛿𝑘𝑙\langle\varphi_{k},\varphi_{l}\rangle=\delta_{kl}⟨ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT, due to the self-adjointness of ΔΔ\Deltaroman_Δ. The Laplacian eigenbasis can also be constructed as a set of orthogonal minimisers of the Dirichlet energy,

φk+1=argminφφ2s.t.φ=1andφ,φj=0formulae-sequencesubscript𝜑𝑘1subscript𝜑superscriptnorm𝜑2s.t.norm𝜑1and𝜑subscript𝜑𝑗0\varphi_{k+1}=\arg\min_{\varphi}\|\nabla\varphi\|^{2}\quad\quad\text{s.t.}\quad\quad\|\varphi\|=1\,\,\,\text{and}\,\,\,\langle\varphi,\varphi_{j}\rangle=0italic_φ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ∥ ∇ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT s.t. ∥ italic_φ ∥ = 1 and ⟨ italic_φ , italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = 0

for j=0,,k𝑗0𝑘j=0,\ldots,kitalic_j = 0 , … , italic_k, allowing to interpret it as the smoothest orthogonal basis on ΩΩ\Omegaroman_Ω. The eigenfunctions φ0,φ1,subscript𝜑0subscript𝜑1\varphi_{0},\varphi_{1},\ldotsitalic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … and the corresponding eigenvalues 0=λ0λ10subscript𝜆0subscript𝜆10=\lambda_{0}\leq\lambda_{1}\leq\ldots0 = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ … can be interpreted as the analogy of the atoms and frequencies in the classical Fourier transform. 100100100In fact eiξusuperscript𝑒i𝜉𝑢e^{\mathrm{i}\xi u}italic_e start_POSTSUPERSCRIPT roman_i italic_ξ italic_u end_POSTSUPERSCRIPT are the eigenfunctions of the Euclidean Laplacian d2du2superscriptd2dsuperscript𝑢2\tfrac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}divide start_ARG roman_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

This orthogonal basis allows to expand square-integrable functions on ΩΩ\Omegaroman_Ω into Fourier series

x(u)=k0x,φkφk(u)𝑥𝑢subscript𝑘0𝑥subscript𝜑𝑘subscript𝜑𝑘𝑢x(u)=\sum_{k\geq 0}\langle x,\varphi_{k}\rangle\varphi_{k}(u)italic_x ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ⟨ italic_x , italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u )

where x^k=x,φksubscript^𝑥𝑘𝑥subscript𝜑𝑘\hat{x}_{k}=\langle x,\varphi_{k}\rangleover^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ⟨ italic_x , italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ are referred to as the Fourier coefficient or the (generalised) Fourier transform of x𝑥xitalic_x. 101101101Note that this Fourier transform has a discrete index, since the spectrum is discrete due to the compactness of ΩΩ\Omegaroman_Ω.Truncating the Fourier series results in an approximation error that can be bounded (Aflalo and Kimmel, 2013) by

xk=0Nx,φkφk2x2λN+1.superscriptnorm𝑥superscriptsubscript𝑘0𝑁𝑥subscript𝜑𝑘subscript𝜑𝑘2superscriptnorm𝑥2subscript𝜆𝑁1\left\|x-\sum_{k=0}^{N}\langle x,\varphi_{k}\rangle\varphi_{k}\right\|^{2}\leq\frac{\|\nabla x\|^{2}}{\lambda_{N+1}}.∥ italic_x - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟨ italic_x , italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG ∥ ∇ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT end_ARG .

Aflalo et al. (2015) further showed that no other basis attains a better error, making the Laplacian eigenbasis optimal for representing smooth signals on manifolds.

Spectral Convolution on Manifolds

Spectral convolution can be defined as the product of Fourier transforms of the signal x𝑥xitalic_x and the filter θ𝜃\thetaitalic_θ,

(xθ)(u)=k0(x^kθ^k)φk(u).𝑥𝜃𝑢subscript𝑘0subscript^𝑥𝑘subscript^𝜃𝑘subscript𝜑𝑘𝑢(x\star\theta)(u)=\sum_{k\geq 0}(\hat{x}_{k}\cdot\hat{\theta}_{k})\varphi_{k}(u).( italic_x ⋆ italic_θ ) ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ) . (17)

Note that here we use what is a property of the classical Fourier transform (the Convolution Theorem) as a way to define a non-Euclidean convolution. By virtue of its construction, the spectral convolution is intrinsic and thus isometry-invariant. Furthermore, since the Laplacian operator is isotropic, it has no sense of direction; in this sense, the situation is similar to that we had on graphs in Section 4.1 due to permutation invariance of neighbour aggregation.

Refer to caption
Figure 12: Instability of spectral filters under domain perturbation. Left: a signal 𝐱𝐱\mathbf{x}bold_x on the mesh ΩΩ\Omegaroman_Ω. Middle: result of spectral filtering in the eigenbasis of the Laplacian ΔΔ\Deltaroman_Δ on ΩΩ\Omegaroman_Ω. Right: the same spectral filter applied to the eigenvectors of the Laplacian Δ~~Δ\tilde{\Delta}over~ start_ARG roman_Δ end_ARG of a nearly-isometrically perturbed domain Ω~~Ω\tilde{\Omega}over~ start_ARG roman_Ω end_ARG produces a very different result.

In practice, a direct computation of (17) appears to be prohibitively expensive due to the need to diagonalise the Laplacian. Even worse, it turns out geometrically unstable: the higher-frequency eigenfunctions of the Laplacian can change dramatically as a result of even small near-isometric perturbations of the domain ΩΩ\Omegaroman_Ω (see Figure 12). A more stable solution is provided by realising the filter as a spectral transfer function of the form p^(Δ)^𝑝Δ\hat{p}(\Delta)over^ start_ARG italic_p end_ARG ( roman_Δ ),

(p^(Δ)x)(u)^𝑝Δ𝑥𝑢\displaystyle(\hat{p}(\Delta)x)(u)( over^ start_ARG italic_p end_ARG ( roman_Δ ) italic_x ) ( italic_u ) =\displaystyle== k0p^(λk)x,φkφk(u)subscript𝑘0^𝑝subscript𝜆𝑘𝑥subscript𝜑𝑘subscript𝜑𝑘𝑢\displaystyle\sum_{k\geq 0}\hat{p}(\lambda_{k})\langle x,\varphi_{k}\rangle\varphi_{k}(u)∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟨ italic_x , italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ) (18)
=\displaystyle== Ωx(v)k0p^(λk)φk(v)φk(u)dvsubscriptΩ𝑥𝑣subscript𝑘0^𝑝subscript𝜆𝑘subscript𝜑𝑘𝑣subscript𝜑𝑘𝑢d𝑣\displaystyle\int_{\Omega}x(v)\,\sum_{k\geq 0}\hat{p}(\lambda_{k})\varphi_{k}(v)\varphi_{k}(u)\,\mathrm{d}v∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_x ( italic_v ) ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v ) italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ) roman_d italic_v (19)

which can be interpreted in two manners: either as a spectral filter (18), where we identify θ^k=p^(λk)subscript^𝜃𝑘^𝑝subscript𝜆𝑘\hat{\theta}_{k}=\hat{p}(\lambda_{k})over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), or as a spatial filter (19) with a position-dependent kernel θ(u,v)=k0p^(λk)φk(v)φk(u)𝜃𝑢𝑣subscript𝑘0^𝑝subscript𝜆𝑘subscript𝜑𝑘𝑣subscript𝜑𝑘𝑢\theta(u,v)=\sum_{k\geq 0}\hat{p}(\lambda_{k})\varphi_{k}(v)\varphi_{k}(u)italic_θ ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v ) italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ). The advantage of this formulation is that p^(λ)^𝑝𝜆\hat{p}(\lambda)over^ start_ARG italic_p end_ARG ( italic_λ ) can be parametrised by a small number of coefficients, and choosing parametric functions such as polynomials102102102Geometric Deep Learning methods based on spectral convolution expressed through the Fourier transform are often referred to as ‘spectral’ and opposed to ‘spatial’ methods we have seen before in the context of graphs. We see here that these two views may be equivalent, so this dichotomy is somewhat artificial and not completely appropriate.p^(λ)=l=0rαlλl^𝑝𝜆superscriptsubscript𝑙0𝑟subscript𝛼𝑙superscript𝜆𝑙\hat{p}(\lambda)=\sum_{l=0}^{r}\alpha_{l}\lambda^{l}over^ start_ARG italic_p end_ARG ( italic_λ ) = ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT allows for efficiently computing the filter as

(p^(Δ)x)(u)=k0l=0rαlλklx,φkφk(u)=l=0rαl(Δlx)(u),^𝑝Δ𝑥𝑢subscript𝑘0superscriptsubscript𝑙0𝑟subscript𝛼𝑙superscriptsubscript𝜆𝑘𝑙𝑥subscript𝜑𝑘subscript𝜑𝑘𝑢superscriptsubscript𝑙0𝑟subscript𝛼𝑙superscriptΔ𝑙𝑥𝑢(\hat{p}(\Delta)x)(u)=\sum_{k\geq 0}\sum_{l=0}^{r}\alpha_{l}\lambda_{k}^{l}\,\langle x,\varphi_{k}\rangle\varphi_{k}(u)=\sum_{l=0}^{r}\alpha_{l}(\Delta^{l}x)(u),( over^ start_ARG italic_p end_ARG ( roman_Δ ) italic_x ) ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ⟨ italic_x , italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_x ) ( italic_u ) ,

avoiding the spectral decomposition altogether. We will discuss this construction in further detail in Section 4.6.

Spatial Convolution on Manifolds

A second alternative is to attempt defining convolution on manifolds is by matching a filter at different points, like we did in formula (14),

(xθ)(u)=TuΩx(expuY)θu(Y)dY,𝑥𝜃𝑢subscriptsubscript𝑇𝑢Ω𝑥subscript𝑢𝑌subscript𝜃𝑢𝑌differential-d𝑌(x\star\theta)(u)=\int_{T_{u}\Omega}x(\exp_{u}Y)\theta_{u}(Y)\mathrm{d}Y,( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_x ( roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_Y ) italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_Y ) roman_d italic_Y , (20)

where we now have to use the exponential map to access the values of the scalar field x𝑥xitalic_x from the tangent space, and the filter θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is defined in the tangent space at each point and hence position-dependent. If one defines the filter intrinsically, such a convolution would be isometry-invariant, a property we mentioned as crucial in many computer vision and graphics applications.

We need, however, to note several substantial differences from our previous construction in Sections 4.24.3. First, because a manifold is generally not a homogeneous space, we do not have anymore a global group structure allowing us have a shared filter (i.e., the same θ𝜃\thetaitalic_θ at every u𝑢uitalic_u rather than θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in expression (20)) defined at one point and then move it around. An analogy of this operation on the manifold would require parallel transport, allowing to apply a shared θ𝜃\thetaitalic_θ, defined as a function on TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, at some other TvΩsubscript𝑇𝑣ΩT_{v}\Omegaitalic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω. However, as we have seen, this in general will depend on the path between u𝑢uitalic_u and v𝑣vitalic_v, so the way we move the filter around matters.Third, since we can use the exponential map only locally, the filter must be local, with support bounded by the injectivity radius. Fourth and most crucially, we cannot work with θ(X)𝜃𝑋\theta(X)italic_θ ( italic_X ), as X𝑋Xitalic_X is an abstract geometric object: in order for it to be used for computations, we must represent it relative to some local basis ωu:sTuΩ:subscript𝜔𝑢superscript𝑠subscript𝑇𝑢Ω\omega_{u}:\mathbb{R}^{s}\rightarrow T_{u}\Omegaitalic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT → italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, as an s𝑠sitalic_s-dimensional array of coordinates 𝐱=ωu1(X)𝐱subscriptsuperscript𝜔1𝑢𝑋\mathbf{x}=\omega^{-1}_{u}(X)bold_x = italic_ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_X ). This allows us to rewrite the convolution (20) as

(xθ)(u)=[0,1]sx(expu(ωu𝐲))θ(𝐲)d𝐲,𝑥𝜃𝑢subscriptsuperscript01𝑠𝑥subscript𝑢subscript𝜔𝑢𝐲𝜃𝐲differential-d𝐲(x\star\theta)(u)=\int_{[0,1]^{s}}x(\exp_{u}(\omega_{u}\mathbf{y}))\theta(\mathbf{y})\mathrm{d}\mathbf{y},( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT [ 0 , 1 ] start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ( roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_y ) ) italic_θ ( bold_y ) roman_d bold_y , (21)

with the filter defined on the unit cube. Since the exponential map is intrinsic (through the definition of geodesic), the resulting convolution is isometry-invariant.

Yet, this tacitly assumed we can carry the frame ωusubscript𝜔𝑢\omega_{u}italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT along to another manifold, i.e. ωu=dηuωusuperscriptsubscript𝜔𝑢dsubscript𝜂𝑢subscript𝜔𝑢\omega_{u}^{\prime}=\mathrm{d}\eta_{u}\circ\omega_{u}italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_d italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∘ italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Obtaining such a frame (or gauge, in physics terminology) given only a the manifold ΩΩ\Omegaroman_Ω in a consistent manner is however fraught with difficulty. First, a smooth global gauge may not exist: this is the situation on manifolds that are not parallelisable,103103103The sphere 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is an example of a non-parallelisable manifold, a result of the Poincaré-Hopf Theorem, which is colloquially stated as ‘one cannot comb a hairy ball without creating a cowlick.’ in which case one cannot define a smooth non-vanishing tangent vector field. Second, we do not have a canonical gauge on manifolds, so this choice is arbitrary; since our convolution depends on ω𝜔\omegaitalic_ω, if one chose a different one, we would obtain different results.

We should note that this is a case where practice diverges from theory: in practice, it is possible to build frames that are mostly smooth, with a limited number of singularities, e.g. by taking the intrinsic gradient of some intrinsic scalar field on the manifold. 104104104 [Uncaptioned image] Example of stable gauges constructed on nearly-isometric manifolds (only one axis is shown) using the GFrames algorithm of Melzi et al. (2019). Moreover, such constructions are stable, i.e., the frames constructed this way will be identical on isometric manifolds and similar on approximately isometric ones. Such approaches were in fact employed in the early works on deep learning on manifolds (Masci et al., 2015; Monti et al., 2017).

Nevertheless, this solution is not entirely satisfactory because near singularities, the filter orientation (being defined in a fixed manner relative to the gauge) will vary wildly, leading to a non-smooth feature map even if the input signal and filter are smooth. Moreover, there is no clear reason why a given direction at some point u𝑢uitalic_u should be considered equivalent to another direction at an altogether different point v𝑣vitalic_v. Thus, despite practical alternatives, we will look next for a more theoretically well-founded approach that would be altogether independent on the choice of gauge.

4.5 Gauges and Bundles

The notion of gauge, which we have defined as a frame for the tangent space, is quite a bit more general in physics: it can refer to a frame for any105105105Historically, fibre bundles arose first in modern differential geometry of Élie Cartan (who however did not define them explicitly), and were then further developed as a standalone object in the field of topology in the 1930s.vector bundle, not just the tangent bundle. Informally, a vector bundle describes a family of vector spaces parametrised by another space and consists of a base space ΩΩ\Omegaroman_Ω with an identical vector space 𝕍𝕍\mathbb{V}blackboard_V (called the fibre) attached to each position uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω (for the tangent bundle these are the tangent spaces TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω). Roughly speaking, a bundle looks as a product Ω×𝕍Ω𝕍\Omega\times\mathbb{V}roman_Ω × blackboard_V locally around u𝑢uitalic_u, but globally might be ‘twisted’ and have an overall different structure. In Geometric Deep Learning, fibres can be used to model the feature spaces at each point in the manifold ΩΩ\Omegaroman_Ω, with the dimension of the fibre being equal to the number of feature channels. In this context, a new and fascinating kind of symmetry, called gauge symmetry may present itself.

Let us consider again an s𝑠sitalic_s-dimensional manifold ΩΩ\Omegaroman_Ω with its tangent bundle TΩ𝑇ΩT\Omegaitalic_T roman_Ω, and a vector field X:ΩTΩ:𝑋Ω𝑇ΩX:\Omega\rightarrow T\Omegaitalic_X : roman_Ω → italic_T roman_Ω (which in this terminology is referred to as a section on the tangent bundle). Relative to a gauge ω𝜔\omegaitalic_ω for the tangent bundle, X𝑋Xitalic_X is represented as a function 𝐱:Ωs:𝐱Ωsuperscript𝑠\mathbf{x}:\Omega\rightarrow\mathbb{R}^{s}bold_x : roman_Ω → blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. However it is important to realise that what we are really interested in is the underlying geometrical object (vector field), whose representation as a function 𝐱𝒳(Ω,s)𝐱𝒳Ωsuperscript𝑠\mathbf{x}\in\mathcal{X}(\Omega,\mathbb{R}^{s})bold_x ∈ caligraphic_X ( roman_Ω , blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) depends on the choice of gauge ω𝜔\omegaitalic_ω. If we change the gauge, we also need to change 𝐱𝐱\mathbf{x}bold_x so as to preserve the underlying vector field being represented.

Tangent bundles and the Structure group

When we change the gauge, we need to apply at each point an invertible matrix that maps the old gauge to the new one. This matrix is unique for every pair of gauges at each point, but possibly different at different points. In other words, a gauge transformation is a mapping 𝔤:ΩGL(s):𝔤ΩGL𝑠{\mathfrak{g}}:\Omega\rightarrow\operatorname{GL}(s)fraktur_g : roman_Ω → roman_GL ( italic_s ), where GL(s)GL𝑠\operatorname{GL}(s)roman_GL ( italic_s ) is the general linear group of invertible s×s𝑠𝑠s\times sitalic_s × italic_s matrices. It acts on the gauge ωu:sTuΩ:subscript𝜔𝑢superscript𝑠subscript𝑇𝑢Ω\omega_{u}:\mathbb{R}^{s}\rightarrow T_{u}\Omegaitalic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT → italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω to produce a new gauge ωu=ωu𝔤u:sTuΩ:subscriptsuperscript𝜔𝑢subscript𝜔𝑢subscript𝔤𝑢superscript𝑠subscript𝑇𝑢Ω\omega^{\prime}_{u}=\omega_{u}\circ{\mathfrak{g}}_{u}:\mathbb{R}^{s}\rightarrow T_{u}\Omegaitalic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∘ fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT → italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω. The gauge transformation acts on a coordinate vector field at each point via 𝐱(u)=𝔤u1𝐱(u)superscript𝐱𝑢subscriptsuperscript𝔤1𝑢𝐱𝑢\mathbf{x}^{\prime}(u)={\mathfrak{g}}^{-1}_{u}\mathbf{x}(u)bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_x ( italic_u ) to produce the coordinate representation 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of X𝑋Xitalic_X relative to the new gauge. The underlying vector field remains unchanged:

X(u)=ωu(𝐱(u))=ωu(𝔤u𝔤u1𝐱(u))=ωu(𝐱(u))=X(u),𝑋𝑢subscriptsuperscript𝜔𝑢superscript𝐱𝑢subscript𝜔𝑢subscript𝔤𝑢subscriptsuperscript𝔤1𝑢𝐱𝑢subscript𝜔𝑢𝐱𝑢𝑋𝑢X(u)=\omega^{\prime}_{u}(\mathbf{x^{\prime}}(u))=\omega_{u}({\mathfrak{g}}_{u}{\mathfrak{g}}^{-1}_{u}\mathbf{x}(u))=\omega_{u}(\mathbf{x}(u))=X(u),italic_X ( italic_u ) = italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) ) = italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_x ( italic_u ) ) = italic_ω start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_x ( italic_u ) ) = italic_X ( italic_u ) ,

which is exactly the property we desired. More generally, we may have a field of geometric quantities that transform according to a representation ρ𝜌\rhoitalic_ρ of GL(s)GL𝑠\operatorname{GL}(s)roman_GL ( italic_s ), e.g. a field of 2-tensors (matrices) 𝐀(u)s×s𝐀𝑢superscript𝑠𝑠\mathbf{A}(u)\in\mathbb{R}^{s\times s}bold_A ( italic_u ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × italic_s end_POSTSUPERSCRIPT that transform like 𝐀(u)=ρ2(𝔤u1)𝐀(u)=ρ1(𝔤u)𝐀(u)ρ1(𝔤u1)superscript𝐀𝑢subscript𝜌2subscriptsuperscript𝔤1𝑢𝐀𝑢subscript𝜌1subscript𝔤𝑢𝐀𝑢subscript𝜌1subscriptsuperscript𝔤1𝑢\mathbf{A^{\prime}}(u)=\rho_{2}({\mathfrak{g}}^{-1}_{u})\mathbf{A}(u)=\rho_{1}({\mathfrak{g}}_{u})\mathbf{A}(u)\rho_{1}({\mathfrak{g}}^{-1}_{u})bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) bold_A ( italic_u ) = italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) bold_A ( italic_u ) italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). In this case, the gauge transformation 𝔤usubscript𝔤𝑢{\mathfrak{g}}_{u}fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT acts via ρ(𝔤u)𝜌subscript𝔤𝑢\rho({\mathfrak{g}}_{u})italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ).

Sometimes we may wish to restrict attention to frames with a certain property, such as orthogonal frames, right-handed frames, etc. Unsurprisingly, we are interested in a set of some property-preserving transformations that form a group. For instance, the group that preserves orthogonality is the orthogonal group O(s)O𝑠\operatorname{O}(s)roman_O ( italic_s ) (rotations and reflections), and the group that additionally preserves orientation or ‘handedness’ is SO(s)SO𝑠\operatorname{SO}(s)roman_SO ( italic_s ) (pure rotations). Thus, in general we have a group 𝔊𝔊{\mathfrak{G}}fraktur_G called the structure group of the bundle, and a gauge transformation is a map 𝔤:Ω𝔊:𝔤Ω𝔊{\mathfrak{g}}:\Omega\rightarrow{\mathfrak{G}}fraktur_g : roman_Ω → fraktur_G. A key observation is that in all cases with the given property, for any two frames at a given point there exists exactly one gauge transformation relating them.

As mentioned before, gauge theory extends beyond tangent bundles, and in general, we can consider a bundle of vector spaces whose structure and dimensions are not necessarily related to those of the base space ΩΩ\Omegaroman_Ω. 106106106We use s𝑠sitalic_s to denote the dimension of the base space ΩΩ\Omegaroman_Ω and d𝑑ditalic_d referring to the dimension of the fibre. For tangent bundles, d=s𝑑𝑠d=sitalic_d = italic_s is the dimension of the underlying manifold. For RGB image, s=2𝑠2s=2italic_s = 2 and d=3𝑑3d=3italic_d = 3.For instance, a color image pixel has a position uΩ=2𝑢Ωsuperscript2u\in\Omega=\mathbb{Z}^{2}italic_u ∈ roman_Ω = blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on a 2D grid and a value 𝐱(u)3𝐱𝑢superscript3\mathbf{x}(u)\in\mathbb{R}^{3}bold_x ( italic_u ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT in the RGB space, so the space of pixels can be viewed as a vector bundle with base space 2superscript2\mathbb{Z}^{2}blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a fibre 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT attached at each point. It is customary to express an RGB image relative to a gauge that has basis vectors for R, G, and B (in that order), so that the coordinate representation of the image looks like 𝐱(u)=(r(u),g(u),b(u))𝐱𝑢superscript𝑟𝑢𝑔𝑢𝑏𝑢top\mathbf{x}(u)=(r(u),g(u),b(u))^{\top}bold_x ( italic_u ) = ( italic_r ( italic_u ) , italic_g ( italic_u ) , italic_b ( italic_u ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. But we may equally well permute the basis vectors (color channels) independently at each position, as long as we remember the frame (order of channels) in use at each point107107107In this example we have chosen 𝔊=Σ3𝔊subscriptΣ3{\mathfrak{G}}=\Sigma_{3}fraktur_G = roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT the permutations of the 3 color channels as the structure group of the bundle. Other choices, such as a Hue rotation 𝔊=SO(2)𝔊SO2{\mathfrak{G}}=\operatorname{SO}(2)fraktur_G = roman_SO ( 2 ) are also possible.. As a computational operation this is rather pointless, but as we will see shortly it is conceptually useful to think about gauge transformations for the space of RGB colors, because it allows us to express a gauge symmetry – in this case, an equivalence between colors – and make functions defined on images respect this symmetry (treating each color equivalently).

As in the case of a vector field on a manifold, an RGB gauge transformation changes the numerical representation of an image (permuting the RGB values independently at each pixel) but not the underlying image. In machine learning applications, we are interested in constructing functions f(𝒳(Ω))𝑓𝒳Ωf\in\mathcal{F}(\mathcal{X}(\Omega))italic_f ∈ caligraphic_F ( caligraphic_X ( roman_Ω ) ) on such images (e.g. to perform image classification or segmentation), implemented as layers of a neural network. It follows that if, for whatever reason, we were to apply a gauge transformation to our image, we would need to also change the function f𝑓fitalic_f (network layers) so as to preserve their meaning. Consider for simplicity a 1×1111\times 11 × 1 convolution, i.e. a map that takes an RGB pixel 𝐱(u)3𝐱𝑢superscript3\mathbf{x}(u)\in\mathbb{R}^{3}bold_x ( italic_u ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to a feature vector 𝐲(u)C𝐲𝑢superscript𝐶\mathbf{y}(u)\in\mathbb{R}^{C}bold_y ( italic_u ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. According to our Geometric Deep Learning blueprint, the output is associated with a group representation ρoutsubscript𝜌out\rho_{\textup{out}}italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, in this case a C𝐶Citalic_C-dimensional representation of the structure group 𝔊=Σ3𝔊subscriptΣ3{\mathfrak{G}}=\Sigma_{3}fraktur_G = roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (RGB channel permutations), and similarly the input is associated with a representation ρin(𝔤)=𝔤subscript𝜌in𝔤𝔤\rho_{\textup{in}}({\mathfrak{g}})={\mathfrak{g}}italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) = fraktur_g. Then, if we apply a gauge transformation to the input, we would need to change the linear map (1×1111\times 11 × 1 convolution) f:3C:𝑓superscript3superscript𝐶f:\mathbb{R}^{3}\rightarrow\mathbb{R}^{C}italic_f : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to f=ρout1(𝔤)fρin(𝔤)superscript𝑓subscriptsuperscript𝜌1out𝔤𝑓subscript𝜌in𝔤f^{\prime}=\rho^{-1}_{\textup{out}}({\mathfrak{g}})\circ f\circ\rho_{\textup{in}}({\mathfrak{g}})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g ) ∘ italic_f ∘ italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) so that the output feature vector 𝐲(u)=f(𝐱(u))𝐲𝑢𝑓𝐱𝑢\mathbf{y}(u)=f(\mathbf{x}(u))bold_y ( italic_u ) = italic_f ( bold_x ( italic_u ) ) transforms like 𝐲(u)=ρout(𝔤u)𝐲(u)superscript𝐲𝑢subscript𝜌outsubscript𝔤𝑢𝐲𝑢\mathbf{y}^{\prime}(u)=\rho_{\textup{out}}({\mathfrak{g}}_{u})\mathbf{y}(u)bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) bold_y ( italic_u ) at every point. Indeed we verify:

𝐲=f(𝐱)=ρout1(𝔤)f(ρin(𝔤)ρin1(𝔤)𝐱)=ρout1(𝔤)f(𝐱).superscript𝐲superscript𝑓superscript𝐱subscriptsuperscript𝜌1out𝔤𝑓subscript𝜌in𝔤subscriptsuperscript𝜌1in𝔤𝐱subscriptsuperscript𝜌1out𝔤𝑓𝐱\mathbf{y}^{\prime}=f^{\prime}(\mathbf{x}^{\prime})=\rho^{-1}_{\textup{out}}({\mathfrak{g}})f(\rho_{\textup{in}}({\mathfrak{g}})\rho^{-1}_{\textup{in}}({\mathfrak{g}})\mathbf{x})=\rho^{-1}_{\textup{out}}({\mathfrak{g}})f(\mathbf{x}).bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g ) italic_f ( italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) bold_x ) = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g ) italic_f ( bold_x ) .

Gauge Symmetries

To say that we consider gauge transformations to be symmetries is to say that any two gauges related by a gauge transformation are to be considered equivalent. For instance, if we take 𝔊=SO(d)𝔊SO𝑑{\mathfrak{G}}=\operatorname{SO}(d)fraktur_G = roman_SO ( italic_d ), any two right-handed orthogonal frames are considered equivalent, because we can map any such frame to any other such frame by a rotation. In other words, there are no distinguished local directions such as “up” or “right”. Similarly, if 𝔊=O(d)𝔊O𝑑{\mathfrak{G}}=\operatorname{O}(d)fraktur_G = roman_O ( italic_d ) (the orthogonal group), then any left and right handed orthogonal frame are considered equivalent. In this case, there is no preferred orientation either. In general, we can consider a group 𝔊𝔊{\mathfrak{G}}fraktur_G and a collection of frames at every point u𝑢uitalic_u such that for any two of them there is a unique 𝔤(u)𝔊𝔤𝑢𝔊{\mathfrak{g}}(u)\in{\mathfrak{G}}fraktur_g ( italic_u ) ∈ fraktur_G that maps one frame onto the other.

Regarding gauge transformations as symmetries in our Geometric Deep Learning blueprint, we are interested in making the functions f𝑓fitalic_f acting on signals defined on ΩΩ\Omegaroman_Ω and expressed with respect to the gauge should equivariant to such transformation. Concretely, this means that if we apply a gauge transformation to the input, the output should undergo the same transformation (perhaps acting via a different representation of 𝔊𝔊{\mathfrak{G}}fraktur_G). We noted before that when we change the gauge, the function f𝑓fitalic_f should be changed as well, but for a gauge equivariant map this is not the case: changing the gauge leaves the mapping invariant. To see this, consider again the RGB color space example. The map f:3C:𝑓superscript3superscript𝐶f:\mathbb{R}^{3}\rightarrow\mathbb{R}^{C}italic_f : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is equivariant if fρin(𝔤)=ρout(𝔤)f𝑓subscript𝜌in𝔤subscript𝜌out𝔤𝑓f\circ\rho_{\textup{in}}({\mathfrak{g}})=\rho_{\textup{out}}({\mathfrak{g}})\circ fitalic_f ∘ italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) = italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g ) ∘ italic_f, but in this case the gauge transformation applied to f𝑓fitalic_f has no effect: ρout1(𝔤)fρin(𝔤)=fsuperscriptsubscript𝜌out1𝔤𝑓subscript𝜌in𝔤𝑓\rho_{\textup{out}}^{-1}({\mathfrak{g}})\circ f\circ\rho_{\textup{in}}({\mathfrak{g}})=fitalic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g ) ∘ italic_f ∘ italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( fraktur_g ) = italic_f. In other words, the coordinate expression of a gauge equivariant map is independent of the gauge, in the same way that in the case of graph, we applied the same function regardless of how the input nodes were permuted. However, unlike the case of graphs and other examples covered so far, gauge transformations act not on ΩΩ\Omegaroman_Ω but separately on each of the feature vectors 𝐱(u)𝐱𝑢\mathbf{x}(u)bold_x ( italic_u ) by a transformation 𝔤(u)𝔊𝔤𝑢𝔊{\mathfrak{g}}(u)\in{\mathfrak{G}}fraktur_g ( italic_u ) ∈ fraktur_G for each uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω.

Further considerations enter the picture when we look at filters on manifolds with a larger spatial support. Let us first consider an easy example of a mapping f:𝒳(Ω,)𝒳(Ω,):𝑓𝒳Ω𝒳Ωf:\mathcal{X}(\Omega,\mathbb{R})\rightarrow\mathcal{X}(\Omega,\mathbb{R})italic_f : caligraphic_X ( roman_Ω , blackboard_R ) → caligraphic_X ( roman_Ω , blackboard_R ) from scalar fields to scalar fields on an s𝑠sitalic_s-dimensional manifold ΩΩ\Omegaroman_Ω. Unlike vectors and other geometric quantities, scalars do not have an orientation, so a scalar field x𝒳(Ω,)𝑥𝒳Ωx\in\mathcal{X}(\Omega,\mathbb{R})italic_x ∈ caligraphic_X ( roman_Ω , blackboard_R ) is invariant to gauge transformations (it transforms according to the trivial representation ρ(𝔤)=1𝜌𝔤1\rho({\mathfrak{g}})=1italic_ρ ( fraktur_g ) = 1). Hence, any linear map from scalar fields to scalar fields is gauge equivariant (or invariant, which is the same in this case). For example, we could write f𝑓fitalic_f similarly to (19), as a convolution-like operation with a position-dependent filter θ:Ω×Ω:𝜃ΩΩ\theta:\Omega\times\Omega\rightarrow\mathbb{R}italic_θ : roman_Ω × roman_Ω → blackboard_R,

(xθ)(u)=Ωθ(u,v)x(v)dv.𝑥𝜃𝑢subscriptΩ𝜃𝑢𝑣𝑥𝑣differential-d𝑣(x\star\theta)(u)=\int_{\Omega}\theta(u,v)x(v)\mathrm{d}v.( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_θ ( italic_u , italic_v ) italic_x ( italic_v ) roman_d italic_v . (22)

This implies that we have a potentially different filter θu=θ(u,)subscript𝜃𝑢𝜃𝑢\theta_{u}=\theta(u,\cdot)italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_θ ( italic_u , ⋅ ) at each point, i.e., no spatial weight sharing — which gauge symmetry alone does not provide.

Consider now a more interesting case of a mapping f:𝒳(Ω,TΩ)𝒳(Ω,TΩ):𝑓𝒳Ω𝑇Ω𝒳Ω𝑇Ωf:\mathcal{X}(\Omega,T\Omega)\rightarrow\mathcal{X}(\Omega,T\Omega)italic_f : caligraphic_X ( roman_Ω , italic_T roman_Ω ) → caligraphic_X ( roman_Ω , italic_T roman_Ω ) from vector fields to vector fields. Relative to a gauge, the input and output vector fields X,Y𝒳(Ω,TΩ)𝑋𝑌𝒳Ω𝑇ΩX,Y\in\mathcal{X}(\Omega,T\Omega)italic_X , italic_Y ∈ caligraphic_X ( roman_Ω , italic_T roman_Ω ) are vector-valued functions 𝐱,𝐲𝒳(Ω,s)𝐱𝐲𝒳Ωsuperscript𝑠\mathbf{x},\mathbf{y}\in\mathcal{X}(\Omega,\mathbb{R}^{s})bold_x , bold_y ∈ caligraphic_X ( roman_Ω , blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ). A general linear map between such functions can be written using the same equation we used for scalars (22), only replacing the scalar kernel by a matrix-valued one 𝚯:Ω×Ωs×s:𝚯ΩΩsuperscript𝑠𝑠\bm{\Theta}:\Omega\times\Omega\rightarrow\mathbb{R}^{s\times s}bold_Θ : roman_Ω × roman_Ω → blackboard_R start_POSTSUPERSCRIPT italic_s × italic_s end_POSTSUPERSCRIPT. The matrix 𝚯(u,v)𝚯𝑢𝑣\bm{\Theta}(u,v)bold_Θ ( italic_u , italic_v ) should map tangent vectors in TvΩsubscript𝑇𝑣ΩT_{v}\Omegaitalic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Ω to tangent vectors in TuΩsubscript𝑇𝑢ΩT_{u}\Omegaitalic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω, but these points have different gauges that we may change arbitrarily and independently. That is, the filter would have to satisfy 𝚯(u,v)=ρ1(𝔤(u))𝚯(u,v)ρ(𝔤(v))𝚯𝑢𝑣superscript𝜌1𝔤𝑢𝚯𝑢𝑣𝜌𝔤𝑣\bm{\Theta}(u,v)=\rho^{-1}({\mathfrak{g}}(u))\bm{\Theta}(u,v)\rho({\mathfrak{g}}(v))bold_Θ ( italic_u , italic_v ) = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g ( italic_u ) ) bold_Θ ( italic_u , italic_v ) italic_ρ ( fraktur_g ( italic_v ) ) for all u,vΩ𝑢𝑣Ωu,v\in\Omegaitalic_u , italic_v ∈ roman_Ω, where ρ𝜌\rhoitalic_ρ denotes the action of 𝔊𝔊{\mathfrak{G}}fraktur_G on vectors, given by an s×s𝑠𝑠s\times sitalic_s × italic_s rotation matrix. Since 𝔤(u)𝔤𝑢{\mathfrak{g}}(u)fraktur_g ( italic_u ) and 𝔤(v)𝔤𝑣{\mathfrak{g}}(v)fraktur_g ( italic_v ) can be chosen freely, this is an overly strong constraint on the filter. 109109109Indeed 𝚯𝚯\bm{\Theta}bold_Θ would have to be zero in this case

A better approach is to first transport the vectors to a common tangent space by means of the connection, and then impose gauge equivariance w.r.t. a single gauge transformation at one point only. Instead of (22), we can then define the following map between vector fields,

(𝐱𝚯)(u)=Ω𝚯(u,v)ρ(𝔤vu)𝐱(v)dv,𝐱𝚯𝑢subscriptΩ𝚯𝑢𝑣𝜌subscript𝔤𝑣𝑢𝐱𝑣differential-d𝑣(\mathbf{x}\star\bm{\Theta})(u)=\int_{\Omega}\bm{\Theta}(u,v)\rho({\mathfrak{g}}_{v\rightarrow u})\mathbf{x}(v)\mathrm{d}v,( bold_x ⋆ bold_Θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT bold_Θ ( italic_u , italic_v ) italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ) bold_x ( italic_v ) roman_d italic_v , (23)

where 𝔤vu𝔊subscript𝔤𝑣𝑢𝔊{\mathfrak{g}}_{v\rightarrow u}\in{\mathfrak{G}}fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ∈ fraktur_G denotes the parallel transport from v𝑣vitalic_v to u𝑢uitalic_u along the geodesic connecting these two points; its representation ρ(𝔤vu)𝜌subscript𝔤𝑣𝑢\rho({\mathfrak{g}}_{v\rightarrow u})italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ) is an s×s𝑠𝑠s\times sitalic_s × italic_s rotation matrix rotating the vector as it moves between the points. Note that this geodesic is assumed to be unique, which is true only locally and thus the filter must have a local support. Under a gauge transformation 𝔤usubscript𝔤𝑢{\mathfrak{g}}_{u}fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, this element transforms as 𝔤uv𝔤u1𝔤uv𝔤vmaps-tosubscript𝔤𝑢𝑣superscriptsubscript𝔤𝑢1subscript𝔤𝑢𝑣subscript𝔤𝑣{\mathfrak{g}}_{u\rightarrow v}\mapsto{\mathfrak{g}}_{u}^{-1}{\mathfrak{g}}_{u\rightarrow v}{\mathfrak{g}}_{v}fraktur_g start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT ↦ fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and the field itself transforms as 𝐱(v)ρ(𝔤v)𝐱(v)maps-to𝐱𝑣𝜌subscript𝔤𝑣𝐱𝑣\mathbf{x}(v)\mapsto\rho({\mathfrak{g}}_{v})\mathbf{x}(v)bold_x ( italic_v ) ↦ italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) bold_x ( italic_v ). If the filter commutes with the structure group representation 𝚯(u,v)ρ(𝔤u)=ρ(𝔤u)𝚯(u,v)𝚯𝑢𝑣𝜌subscript𝔤𝑢𝜌subscript𝔤𝑢𝚯𝑢𝑣\bm{\Theta}(u,v)\rho({\mathfrak{g}}_{u})=\rho({\mathfrak{g}}_{u})\bm{\Theta}(u,v)bold_Θ ( italic_u , italic_v ) italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) bold_Θ ( italic_u , italic_v ), equation (23) defines a gauge-equivariant convolution, which transforms as

(𝐱𝚯)(u)=ρ1(𝔤u)(𝐱𝚯)(u).superscript𝐱𝚯𝑢superscript𝜌1subscript𝔤𝑢𝐱𝚯𝑢(\mathbf{x}^{\prime}\star\bm{\Theta})(u)=\rho^{-1}({\mathfrak{g}}_{u})(\mathbf{x}\star\bm{\Theta})(u).( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋆ bold_Θ ) ( italic_u ) = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ( bold_x ⋆ bold_Θ ) ( italic_u ) .

under the aforementioned transformation.

4.6 Geometric graphs and Meshes

We will conclude our discussion of different geometric domains with geometric graphs (i.e., graphs that can be realised in some geometric space) and meshes. In our ‘5G’ of geometric domains, meshes fall somewhere between graphs and manifolds: in many regards, they are similar to graphs, but their additional structure allows to also treat them similarly to continuous objects. For this reason, we do not consider meshes as a standalone object in our scheme, and in fact, will emphasise that many of the constructions we derive in this section for meshes are directly applicable to general graphs as well.

As we already mentioned in Section 4.4, two-dimensional manifolds (surfaces) are a common way of modelling 3D objects (or, better said, the boundary surfaces of such objects). In computer graphics and vision applications, such surfaces are often discretised as triangular meshes, 110110110Triangular meshes are examples of topological structures known as simplicial complexes.which can be roughly thought of as a piece-wise planar approximation of a surface obtained by gluing triangles together along their edges. Meshes are thus (undirected) graphs with additional structure: in addition to nodes and edges, a mesh 𝒯=(𝒱,,)𝒯𝒱\mathcal{T}=(\mathcal{V},\mathcal{E},\mathcal{F})caligraphic_T = ( caligraphic_V , caligraphic_E , caligraphic_F ) also have ordered triplets of nodes forming triangular faces ={(u,v,q):u,v,q𝒱and(u,v),(u,q),(q,v)}conditional-set𝑢𝑣𝑞formulae-sequence𝑢𝑣𝑞𝒱and𝑢𝑣𝑢𝑞𝑞𝑣\mathcal{F}=\{(u,v,q):u,v,q\in\mathcal{V}\,\,\,\text{and}\,\,\,(u,v),(u,q),(q,v)\in\mathcal{E}\}caligraphic_F = { ( italic_u , italic_v , italic_q ) : italic_u , italic_v , italic_q ∈ caligraphic_V and ( italic_u , italic_v ) , ( italic_u , italic_q ) , ( italic_q , italic_v ) ∈ caligraphic_E }; the order of the nodes defines the face orientation. 111111111 [Uncaptioned image]
Examples of manifold (top) and non-manifold (bottom) edges and nodes. For manifolds with boundary, one further defines boundary edges that belong to exactly one triangle.

It is further assumed that that each edge is shared by exactly two triangles, and the boundary of all triangles incident on each node forms a single loop of edges. This condition guarantees that 1-hop neighbourhoods around each node are disk-like and the mesh thus constitutes a discrete manifold – such meshes are referred to as manifold meshes. Similarly to Riemannian manifolds, we can define a metric on the mesh. In the simplest instance, it can be induced from the embedding of the mesh nodes 𝐱1,,𝐱nsubscript𝐱1subscript𝐱𝑛\mathbf{x}_{1},\ldots,\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and expressed through the Euclidean length of the edges, uv=𝐱u𝐱vsubscript𝑢𝑣normsubscript𝐱𝑢subscript𝐱𝑣\ell_{uv}=\|\mathbf{x}_{u}-\mathbf{x}_{v}\|roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = ∥ bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥. A metric defined in this way automatically satisfies properties such as the triangle inequality, i.e., expressions of the form uvuq+vqsubscript𝑢𝑣subscript𝑢𝑞subscript𝑣𝑞\ell_{uv}\leq\ell_{uq}+\ell_{vq}roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ≤ roman_ℓ start_POSTSUBSCRIPT italic_u italic_q end_POSTSUBSCRIPT + roman_ℓ start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT for any (u,v,q)𝑢𝑣𝑞(u,v,q)\in\mathcal{F}( italic_u , italic_v , italic_q ) ∈ caligraphic_F and any combination of edges. Any property that can be expressed solely in terms of \ellroman_ℓ is intrinsic, and any deformation of the mesh preserving \ellroman_ℓ is an isometry – these notions are already familiar to the reader from our discussion in Section 4.4.

Laplacian matrices

By analogy to our treatment of graphs, let us assume a (manifold) mesh with n𝑛nitalic_n nodes, each associated with a d𝑑ditalic_d-dimensional feature vector, which we can arrange (assuming some arbitrary ordering) into an n×d𝑛𝑑n\times ditalic_n × italic_d matrix 𝐗𝐗\mathbf{X}bold_X. The features can represent the geometric coordinates of the nodes as well as additional properties such as colors, normals, etc, or in specific applications such as chemistry where geometric graphs model molecules, properties such as the atomic number.

Let us first look at the spectral convolution (17) on meshes, which we remind the readers, arises from the Laplacian operator. Considering the mesh as a discretisation of an underlying continuous surface, we can discretise the Laplacian as

(𝚫𝐗)u=v𝒩uwuv(𝐱u𝐱v),subscript𝚫𝐗𝑢subscript𝑣subscript𝒩𝑢subscript𝑤𝑢𝑣subscript𝐱𝑢subscript𝐱𝑣(\bm{\Delta}\mathbf{X})_{u}=\sum_{v\in\mathcal{N}_{u}}w_{uv}(\mathbf{x}_{u}-\mathbf{x}_{v}),( bold_Δ bold_X ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , (24)

or in matrix-vector notation, as an n×n𝑛𝑛n\times nitalic_n × italic_n symmetric matrix 𝚫=𝐃𝐖𝚫𝐃𝐖\bm{\Delta}=\mathbf{D}-\mathbf{W}bold_Δ = bold_D - bold_W, where 𝐃=diag(d1,,dn)𝐃diagsubscript𝑑1subscript𝑑𝑛\mathbf{D}=\mathrm{diag}(d_{1},\ldots,d_{n})bold_D = roman_diag ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is called the degree matrix and du=vwuvsubscript𝑑𝑢subscript𝑣subscript𝑤𝑢𝑣d_{u}=\sum_{v}w_{uv}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT the degree of node u𝑢uitalic_u. It is easy to see that equation (24) performs local permutation-invariant aggregation of neighbour features ϕ(𝐱u,𝐗𝒩u)=du𝐱uv𝒩uwuv𝐱vitalic-ϕsubscript𝐱𝑢subscript𝐗subscript𝒩𝑢subscript𝑑𝑢subscript𝐱𝑢subscript𝑣subscript𝒩𝑢subscript𝑤𝑢𝑣subscript𝐱𝑣\phi(\mathbf{x}_{u},\mathbf{X}_{\mathcal{N}_{u}})=d_{u}\mathbf{x}_{u}-\sum_{v\in\mathcal{N}_{u}}w_{uv}\mathbf{x}_{v}italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and 𝐅(𝐗)=𝚫𝐗𝐅𝐗𝚫𝐗\mathbf{F}(\mathbf{X})=\bm{\Delta}\mathbf{X}bold_F ( bold_X ) = bold_Δ bold_X is in fact an instance of our general blueprint (13) for constructing permutation-equivariant functions on graphs.

Note that insofar there is nothing specific to meshes in our definition of Laplacian in (24); in fact, this construction is valid for arbitrary graphs as well, with edge weights identified with the adjacency matrix, 𝐖=𝐀𝐖𝐀\mathbf{W}=\mathbf{A}bold_W = bold_A, i.e., wuv=1subscript𝑤𝑢𝑣1w_{uv}=1italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = 1112112112The degree in this case equals the number of neighbours. if (u,v)𝑢𝑣(u,v)\in\mathcal{E}( italic_u , italic_v ) ∈ caligraphic_E and zero otherwise. Laplacians constructed in this way are often called combinatorial, to reflect the fact that they merely capture the connectivity structure of the graph. 113113113If the graph is directed, the corresponding Laplacian is non-symmetric. For geometric graphs (which do not necessarily have the additional structure of meshes, but whose nodes do have spatial coordinates that induces a metric in the form of edge lengths), it is common to use weights inversely related to the metric, e.g. wuveuvproportional-tosubscript𝑤𝑢𝑣superscript𝑒subscript𝑢𝑣w_{uv}\propto e^{-\ell_{uv}}italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ∝ italic_e start_POSTSUPERSCRIPT - roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

On meshes, we can exploit the additional structure afforded by the faces, and define the edge weights in equation (24) using the cotangent formula (Pinkall and Polthier, 1993; Meyer et al., 2003)114114114 [Uncaptioned image] The earliest use of this formula dates back to the PhD thesis of MacNeal (1949), who developed it to solve PDEs on the Caltech Electric Analog Computer.

wuv=cotuqv+cotupv2ausubscript𝑤𝑢𝑣subscript𝑢𝑞𝑣subscript𝑢𝑝𝑣2subscript𝑎𝑢w_{uv}=\frac{\cot\angle_{uqv}+\cot\angle_{upv}}{2a_{u}}italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = divide start_ARG roman_cot ∠ start_POSTSUBSCRIPT italic_u italic_q italic_v end_POSTSUBSCRIPT + roman_cot ∠ start_POSTSUBSCRIPT italic_u italic_p italic_v end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG (25)

where uqvsubscript𝑢𝑞𝑣\angle_{uqv}∠ start_POSTSUBSCRIPT italic_u italic_q italic_v end_POSTSUBSCRIPT and upvsubscript𝑢𝑝𝑣\angle_{upv}∠ start_POSTSUBSCRIPT italic_u italic_p italic_v end_POSTSUBSCRIPT are the two angles in the triangles (u,q,v)𝑢𝑞𝑣(u,q,v)( italic_u , italic_q , italic_v ) and (u,p,v)𝑢𝑝𝑣(u,p,v)( italic_u , italic_p , italic_v ) opposite the shared edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ), and ausubscript𝑎𝑢a_{u}italic_a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the local area element, typically computed as the area of the polygon constructed upon the barycenters of the triangles (u,p,q)𝑢𝑝𝑞(u,p,q)( italic_u , italic_p , italic_q ) sharing the node u𝑢uitalic_u and given by au=13v,q:(u,v,q)auvqsubscript𝑎𝑢13subscript:𝑣𝑞𝑢𝑣𝑞subscript𝑎𝑢𝑣𝑞a_{u}=\frac{1}{3}\sum_{v,q:(u,v,q)\in\mathcal{F}}a_{uvq}italic_a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG ∑ start_POSTSUBSCRIPT italic_v , italic_q : ( italic_u , italic_v , italic_q ) ∈ caligraphic_F end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT.

The cotangent Laplacian can be shown to have multiple convenient properties (see e.g. Wardetzky et al. (2007)): it is a positive-semidefinite matrix, 𝚫0succeeds-or-equals𝚫0\bm{\Delta}\succcurlyeq 0bold_Δ ≽ 0 and thus has non-negative eigenvalues λ1λnsubscript𝜆1subscript𝜆𝑛\lambda_{1}\leq\ldots\leq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ … ≤ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that can be regarded as an analogy of frequency, it is symmetric and thus has orthogonal eigenvectors, and it is local (i.e., the value of (𝚫𝐗)usubscript𝚫𝐗𝑢(\bm{\Delta}\mathbf{X})_{u}( bold_Δ bold_X ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT depends only on 1-hop neighbours, 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT). Perhaps the most important property is the convergence of the cotangent mesh Laplacian matrix 𝚫𝚫\bm{\Delta}bold_Δ to the continuous operator ΔΔ\Deltaroman_Δ when the mesh is infinitely refined (Wardetzky, 2008). Equation (25) constitutes thus an appropriate discretisation115115115Some technical conditions must be imposed on the refinement, to avoid e.g. triangles becoming pathological. One such example is a bizarre triangulation of the cylinder known in German as the Schwarzscher Stiefel (Schwarz’s boot) or in English literature as the ‘Schwarz lantern’, proposed in 1880 by Hermann Schwarz, a German mathematician known from the Cauchy-Schwarz inequality fame. of the Laplacian operator defined on Riemannian manifolds in Section 4.4.

While one expects the Laplacian to be intrinsic, this is not very obvious from equation (25), and it takes some effort to express the cotangent weights entirely in terms of the discrete metric \ellroman_ℓ as

wuv=uv2+vq2+uq28auvq+uv2+vp2+up28auvpsubscript𝑤𝑢𝑣subscriptsuperscript2𝑢𝑣subscriptsuperscript2𝑣𝑞subscriptsuperscript2𝑢𝑞8subscript𝑎𝑢𝑣𝑞subscriptsuperscript2𝑢𝑣subscriptsuperscript2𝑣𝑝subscriptsuperscript2𝑢𝑝8subscript𝑎𝑢𝑣𝑝w_{uv}=\frac{-\ell^{2}_{uv}+\ell^{2}_{vq}+\ell^{2}_{uq}}{8a_{uvq}}+\frac{-\ell^{2}_{uv}+\ell^{2}_{vp}+\ell^{2}_{up}}{8a_{uvp}}italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = divide start_ARG - roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT + roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT + roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_q end_POSTSUBSCRIPT end_ARG start_ARG 8 italic_a start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT end_ARG + divide start_ARG - roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT + roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_p end_POSTSUBSCRIPT + roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT end_ARG start_ARG 8 italic_a start_POSTSUBSCRIPT italic_u italic_v italic_p end_POSTSUBSCRIPT end_ARG

where the area of the triangles aijksubscript𝑎𝑖𝑗𝑘a_{ijk}italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT is given as

auvq=suvq(suvquv)(suvqvq)(suvquq)subscript𝑎𝑢𝑣𝑞subscript𝑠𝑢𝑣𝑞subscript𝑠𝑢𝑣𝑞subscript𝑢𝑣subscript𝑠𝑢𝑣𝑞subscript𝑣𝑞subscript𝑠𝑢𝑣𝑞subscript𝑢𝑞a_{uvq}=\sqrt{s_{uvq}(s_{uvq}-\ell_{uv})(s_{uvq}-\ell_{vq})(s_{uvq}-\ell_{uq})}italic_a start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT = square-root start_ARG italic_s start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT - roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT - roman_ℓ start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT - roman_ℓ start_POSTSUBSCRIPT italic_u italic_q end_POSTSUBSCRIPT ) end_ARG

using Heron’s semiperimeter formula with suvq=12(uv+uq+vq)subscript𝑠𝑢𝑣𝑞12subscript𝑢𝑣subscript𝑢𝑞subscript𝑣𝑞s_{uvq}=\frac{1}{2}(\ell_{uv}+\ell_{uq}+\ell_{vq})italic_s start_POSTSUBSCRIPT italic_u italic_v italic_q end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_ℓ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT + roman_ℓ start_POSTSUBSCRIPT italic_u italic_q end_POSTSUBSCRIPT + roman_ℓ start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT ). This endows the Laplacian (and any quantities associated with it, such as its eigenvectors and eigenvalues) with isometry invariance, a property for which it is so loved in geometry processing and computer graphics (see an excellent review by Wang and Solomon (2019)): any deformation of the mesh that does not affect the metric \ellroman_ℓ (does not ‘stretch’ or ‘squeeze’ the edges of the mesh) does not change the Laplacian.

Finally, as we already noticed,116116116 [Uncaptioned image] Laplacian-based filters are isotropic. In the plane, such filters have radial symmetry. the definition of the Laplacian (25) is invariant to the permutation of nodes in 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, as it involves aggregation in the form of summation. While on general graphs this is a necessary evil due to the lack of canonical ordering of neighbours, on meshes we can order the 1-hop neighbours according to some orientation (e.g., clock-wise), and the only ambiguity is the selection of the first node. Thus, instead of any possible permutation we need to account for cyclic shifts (rotations), which intuitively corresponds to the ambiguity arising from SO(2)SO2\mathrm{SO}(2)roman_SO ( 2 ) gauge transformations discussed in Section 4.5. For a fixed gauge, it is possible to define an anisotropic Laplacian that is sensitive to local directions and amounts to changing the metric or the weights wuvsubscript𝑤𝑢𝑣w_{uv}italic_w start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT. Constructions of this kind were used to design shape descriptors by Andreux et al. (2014); Boscaini et al. (2016b) and in early Geometric Deep Learning architectures on meshes by Boscaini et al. (2016a).

Spectral analysis on meshes

The orthogonal eigenvectors 𝚽=(𝝋1,,𝝋n)𝚽subscript𝝋1subscript𝝋𝑛\bm{\Phi}=(\bm{\varphi}_{1},\ldots,\bm{\varphi}_{n})bold_Φ = ( bold_italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) diagonalising the Laplacian matrix (𝚫=𝚽𝚲𝚽𝚫𝚽𝚲superscript𝚽top\bm{\Delta}=\bm{\Phi}\bm{\Lambda}\bm{\Phi}^{\top}bold_Δ = bold_Φ bold_Λ bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝚲=diag(λ1,,λn)𝚲diagsubscript𝜆1subscript𝜆𝑛\bm{\Lambda}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{n})bold_Λ = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the diagonal matrix of Laplacian eigenvalues), are used as the non-Euclidean analogy of the Fourier basis, allowing to perform spectral convolution on the mesh as the product of the respective Fourier transforms,

𝐗𝜽=𝚽diag(𝚽𝜽)(𝚽𝐗)=𝚽diag(𝜽^)𝐗^,𝐗𝜽𝚽diagsuperscript𝚽top𝜽superscript𝚽top𝐗𝚽diag^𝜽^𝐗\mathbf{X}\star\bm{\theta}=\bm{\Phi}\,\mathrm{diag}(\bm{\Phi}^{\top}\bm{\theta})(\bm{\Phi}^{\top}\mathbf{X})=\bm{\Phi}\,\mathrm{diag}(\hat{\bm{\theta}})\hat{\mathbf{X}},bold_X ⋆ bold_italic_θ = bold_Φ roman_diag ( bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_θ ) ( bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X ) = bold_Φ roman_diag ( over^ start_ARG bold_italic_θ end_ARG ) over^ start_ARG bold_X end_ARG ,

where the filter 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG is designed directly in the Fourier domain. Again, nothing in this formula is specific to meshes, and one can use the Laplacian matrix of a generic (undirected) graph.117117117The fact that the graph is assumed to be undirected is important: in this case the Laplacian is symmetric and has orthogonal eigenvectors.It is tempting to exploit this spectral definition of convolution to generalise CNNs to graphs, which in fact was done by one of the authors of this text, Bruna et al. (2013). However, it appears that the non-Euclidean Fourier transform is extremely sensitive to even minor perturbations of the underlying mesh or graph (see Figure 12 in Section 4.4) and thus can only be used when one has to deal with different signals on a fixed domain, but not when one wishes to generalise across different domains. Unluckily, many computer graphics and vision problems fall into the latter category, where one trains a neural network on one set of 3D shapes (meshes) and test on a different set, making the Fourier transform-based approach inappropriate.

As noted in Section 4.4, it is preferable to use spectral filters of the form (18) applying some transfer function p^(λ)^𝑝𝜆\hat{p}(\lambda)over^ start_ARG italic_p end_ARG ( italic_λ ) to the Laplacian matrix,

p^(𝚫)𝐗=𝚽p^(𝚲)𝚽𝐗=𝚽diag(p^(λ1),,p^(λn))𝐗^.^𝑝𝚫𝐗𝚽^𝑝𝚲superscript𝚽top𝐗𝚽diag^𝑝subscript𝜆1^𝑝subscript𝜆𝑛^𝐗\hat{p}(\bm{\Delta})\mathbf{X}=\bm{\Phi}\hat{p}(\bm{\Lambda})\bm{\Phi}^{\top}\mathbf{X}=\bm{\Phi}\,\mathrm{diag}(\hat{p}(\lambda_{1}),\ldots,\hat{p}(\lambda_{n}))\hat{\mathbf{X}}.over^ start_ARG italic_p end_ARG ( bold_Δ ) bold_X = bold_Φ over^ start_ARG italic_p end_ARG ( bold_Λ ) bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X = bold_Φ roman_diag ( over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , over^ start_ARG italic_p end_ARG ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) over^ start_ARG bold_X end_ARG .

When p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG can be expressed in terms of matrix-vector products, the eigendecomposition of the n×n𝑛𝑛n\times nitalic_n × italic_n matrix 𝚫𝚫\bm{\Delta}bold_Δ 118118118In the general case, the complexity of eigendecomposition is 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).can be avoided altogether. For example, Defferrard et al. (2016) used polynomials of degree r𝑟ritalic_r as filter functions,

p^(𝚫)𝐗=k=0rαk𝚫k𝐗=α0𝐗+α1𝚫𝐗++αr𝚫r𝐗,^𝑝𝚫𝐗superscriptsubscript𝑘0𝑟subscript𝛼𝑘superscript𝚫𝑘𝐗subscript𝛼0𝐗subscript𝛼1𝚫𝐗subscript𝛼𝑟superscript𝚫𝑟𝐗\hat{p}(\bm{\Delta})\mathbf{X}=\sum_{k=0}^{r}\alpha_{k}\bm{\Delta}^{k}\mathbf{X}=\alpha_{0}\mathbf{X}+\alpha_{1}\bm{\Delta}\mathbf{X}+\ldots+\alpha_{r}\bm{\Delta}^{r}\mathbf{X},over^ start_ARG italic_p end_ARG ( bold_Δ ) bold_X = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_Δ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_X = italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_X + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Δ bold_X + … + italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_X ,

amounting to the multiplication of the n×d𝑛𝑑n\times ditalic_n × italic_d feature matrix 𝐗𝐗\mathbf{X}bold_X by the n×n𝑛𝑛n\times nitalic_n × italic_n Laplacian matrix r𝑟ritalic_r times. Since the Laplacian is typically sparse (with 𝒪(||)𝒪\mathcal{O}(|\mathcal{E}|)caligraphic_O ( | caligraphic_E | ) non-zero elements) 119119119Meshes are nearly-regular graphs, with each node having 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) neighbours, resulting in 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) non-zeros in 𝚫𝚫\bm{\Delta}bold_Δ. this operation has low complexity of 𝒪(||dr)𝒪(||)similar-to𝒪𝑑𝑟𝒪\mathcal{O}(|\mathcal{E}|dr)\sim\mathcal{O}(|\mathcal{E}|)caligraphic_O ( | caligraphic_E | italic_d italic_r ) ∼ caligraphic_O ( | caligraphic_E | ). Furthermore, since the Laplacian is local, a polynomial filter of degree r𝑟ritalic_r is localised in r𝑟ritalic_r-hop neighbourhood.

However, this exact property comes at a disadvantage when dealing with meshes, since the actual support of the filter (i.e., the radius it covers) depends on the resolution of the mesh. One has to bear in mind that meshes arise from the discretisation of some underlying continuous surface, and one may have two different meshes 𝒯𝒯\mathcal{T}caligraphic_T and 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT representing the same object. 120120120 [Uncaptioned image]
Two-hop neighbourhoods on meshes of different resolution.
In a finer mesh, one might have to use larger neighbourhoods (thus, larger degree r𝑟ritalic_r of the filter) than in a coarser one.

For this reason, in computer graphics applications it is more common to use rational filters, since they are resolution-independent. There are many ways to define such filters (see, e.g. Patanè (2020)), the most common being as a polynomial of some rational function, e.g., λ1λ+1𝜆1𝜆1\frac{\lambda-1}{\lambda+1}divide start_ARG italic_λ - 1 end_ARG start_ARG italic_λ + 1 end_ARG. More generally, one can use a complex function, such as the Cayley transform λiλ+i𝜆i𝜆i\frac{\lambda-\mathrm{i}}{\lambda+\mathrm{i}}divide start_ARG italic_λ - roman_i end_ARG start_ARG italic_λ + roman_i end_ARG that maps the real line into the unit circle in the complex plane. 121121121Cayley transform is a particular case of a Möbius transformation. When applied to the Laplacian (a positive-semindefinite matrix), it maps its non-negative eigenvalues to the complex half-circle. Levie et al. (2018) used spectral filters expressed as Cayley polynomials, real rational functions with complex coefficients αlsubscript𝛼𝑙\alpha_{l}\in\mathbb{C}italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_C,

p^(λ)=Re(l=0rαl(λiλ+i)l).^𝑝𝜆Resuperscriptsubscript𝑙0𝑟subscript𝛼𝑙superscript𝜆i𝜆i𝑙\hat{p}(\lambda)=\mathrm{Re}\left(\sum_{l=0}^{r}\alpha_{l}\left(\frac{\lambda-\mathrm{i}}{\lambda+\mathrm{i}}\right)^{l}\right).over^ start_ARG italic_p end_ARG ( italic_λ ) = roman_Re ( ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( divide start_ARG italic_λ - roman_i end_ARG start_ARG italic_λ + roman_i end_ARG ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) .

When applied to matrices, the computation of the Cayley polynomial requires matrix inversion,

p^(𝚫)=Re(l=0rαl(𝚫i𝐈)l(𝚫+i𝐈)l),^𝑝𝚫Resuperscriptsubscript𝑙0𝑟subscript𝛼𝑙superscript𝚫i𝐈𝑙superscript𝚫i𝐈𝑙\hat{p}(\bm{\Delta})=\mathrm{Re}\left(\sum_{l=0}^{r}\alpha_{l}(\bm{\Delta}-\mathrm{i}\mathbf{I})^{l}(\bm{\Delta}+\mathrm{i}\mathbf{I})^{-l}\right),over^ start_ARG italic_p end_ARG ( bold_Δ ) = roman_Re ( ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_Δ - roman_i bold_I ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_Δ + roman_i bold_I ) start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT ) ,

which can be carried out approximately with linear complexity. Unlike polynomial filters, rational filters do not have a local support, but have exponential decay (Levie et al., 2018). A crucial difference compared to the direct computation of the Fourier transform is that polynomial and rational filters are stable under approximate isometric deformations of the underlying graph or mesh – various results of this kind were shown e.g. by Levie et al. (2018, 2019); Gama et al. (2020); Kenlay et al. (2021).

Meshes as operators and Functional maps

The paradigm of functional maps suggests thinking of meshes as operators. As we will show, this allows obtaining more interesting types of invariance exploiting the additional structure of meshes. For the purpose of our discussion, assume the mesh 𝒯𝒯\mathcal{T}caligraphic_T is constructed upon embedded nodes with coordinates 𝐗𝐗\mathbf{X}bold_X. If we construct an intrinsic operator like the Laplacian, it can be shown that it encodes completely the structure of the mesh, and one can recover the mesh (up to its isometric embedding, as shown by Zeng et al. (2012)). This is also true for some other operators (see e.g. Boscaini et al. (2015); Corman et al. (2017); Chern et al. (2018)), so we will assume a general operator, or n×n𝑛𝑛n\times nitalic_n × italic_n matrix 𝐐(𝒯,𝐗)𝐐𝒯𝐗\mathbf{Q}(\mathcal{T},\mathbf{X})bold_Q ( caligraphic_T , bold_X ), as a representation of our mesh.

In this view, the discussion of Section 4.1 of learning functions of the form f(𝐗,𝒯)𝑓𝐗𝒯f(\mathbf{X},\mathcal{T})italic_f ( bold_X , caligraphic_T ) can be rephrased as learning functions of the form f(𝐐)𝑓𝐐f(\mathbf{Q})italic_f ( bold_Q ). Similar to graphs and sets, the nodes of meshes also have no canonical ordering, i.e., functions on meshes must satisfy the permutation invariance or equivariance conditions,

f(𝐐)𝑓𝐐\displaystyle f(\mathbf{Q})italic_f ( bold_Q ) =\displaystyle== f(𝐏𝐐𝐏)𝑓superscript𝐏𝐐𝐏top\displaystyle f(\mathbf{P}\mathbf{Q}\mathbf{P}^{\top})italic_f ( bold_PQP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT )
𝐏𝐅(𝐐)𝐏𝐅𝐐\displaystyle\mathbf{P}\mathbf{F}(\mathbf{Q})bold_PF ( bold_Q ) =\displaystyle== 𝐅(𝐏𝐐𝐏)𝐅superscript𝐏𝐐𝐏top\displaystyle\mathbf{F}(\mathbf{P}\mathbf{Q}\mathbf{P}^{\top})bold_F ( bold_PQP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT )

for any permutation matrix 𝐏𝐏\mathbf{P}bold_P. However, compared to general graphs we now have more structure: we can assume that our mesh arises from the discretisation of some underlying continuous surface ΩΩ\Omegaroman_Ω. It is thus possible to have a different mesh 𝒯=(𝒱,,)superscript𝒯superscript𝒱superscriptsuperscript\mathcal{T}^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime},\mathcal{F}^{\prime})caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT nodes and coordinates 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT representing the same object ΩΩ\Omegaroman_Ω as 𝒯𝒯\mathcal{T}caligraphic_T. Importantly, the meshes 𝒯𝒯\mathcal{T}caligraphic_T and 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can have a different connectivity structure and even different number of nodes (nnsuperscript𝑛𝑛n^{\prime}\neq nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_n). Therefore, we cannot think of these meshes as isomorphic graphs with mere reordering of nodes and consider the permutation matrix 𝐏𝐏\mathbf{P}bold_P as correspondence between them.

Functional maps were introduced by Ovsjanikov et al. (2012) as a generalisation of the notion of correspondence to such settings, replacing the correspondence between points on two domains (a map η:ΩΩ:𝜂ΩsuperscriptΩ\eta:\Omega\rightarrow\Omega^{\prime}italic_η : roman_Ω → roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) with correspondence between functions (a map 𝐂:𝒳(Ω)𝒳(Ω):𝐂𝒳Ω𝒳superscriptΩ\mathbf{C}:\mathcal{X}(\Omega)\rightarrow\mathcal{X}(\Omega^{\prime})bold_C : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), see Figure 13). A functional map is a linear operator 𝐂𝐂\mathbf{C}bold_C, represented as a matrix n×nsuperscript𝑛𝑛n^{\prime}\times nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_n, establishing correspondence between signals 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐱𝐱\mathbf{x}bold_x on the respective domains as

𝐱=𝐂𝐱.superscript𝐱𝐂𝐱\mathbf{x}^{\prime}=\mathbf{C}\mathbf{x}.bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Cx .

Rustamov et al. (2013) showed that in order to guarantee area-preserving mapping, the functional map must be orthogonal, 𝐂𝐂=𝐈superscript𝐂top𝐂𝐈\mathbf{C}^{\top}\mathbf{C}=\mathbf{I}bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C = bold_I, i.e., be an element of the orthogonal group 𝐂O(n)𝐂O𝑛\mathbf{C}\in\mathrm{O}(n)bold_C ∈ roman_O ( italic_n ). In this case, we can invert the map using 𝐂1=𝐂superscript𝐂1superscript𝐂top\mathbf{C}^{-1}=\mathbf{C}^{\top}bold_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

Refer to caption
Refer to caption
Figure 13: Pointwise map (left) vs functional map (right).

The functional map also establishes a relation between the operator representation of meshes,

𝐐=𝐂𝐐𝐂,𝐐=𝐂𝐐𝐂,formulae-sequencesuperscript𝐐superscript𝐂𝐐𝐂top𝐐superscript𝐂topsuperscript𝐐𝐂\mathbf{Q}^{\prime}=\mathbf{C}\mathbf{Q}\mathbf{C}^{\top},\quad\quad\mathbf{Q}=\mathbf{C}^{\top}\mathbf{Q}^{\prime}\mathbf{C},bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_CQC start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Q = bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_C ,

which we can interpret as follows: given an operator representation 𝐐𝐐\mathbf{Q}bold_Q of 𝒯𝒯\mathcal{T}caligraphic_T and a functional map 𝐂𝐂\mathbf{C}bold_C, we can construct its representation 𝐐superscript𝐐\mathbf{Q}^{\prime}bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by first mapping the signal from 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 𝒯𝒯\mathcal{T}caligraphic_T (using 𝐂superscript𝐂top\mathbf{C}^{\top}bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT), applying the operator 𝐐𝐐\mathbf{Q}bold_Q, and then mapping back to 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (using 𝐂𝐂\mathbf{C}bold_C)124124124Note that we read these operations right-to-left. This leads us to a more general class of remeshing invariant (or equivariant) functions on meshes, satisfying

f(𝐐)𝑓𝐐\displaystyle f(\mathbf{Q})italic_f ( bold_Q ) =\displaystyle== f(𝐂𝐐𝐂)=f(𝐐)𝑓superscript𝐂𝐐𝐂top𝑓superscript𝐐\displaystyle f(\mathbf{C}\mathbf{Q}\mathbf{C}^{\top})=f(\mathbf{Q}^{\prime})italic_f ( bold_CQC start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = italic_f ( bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
𝐂𝐅(𝐐)𝐂𝐅𝐐\displaystyle\mathbf{C}\mathbf{F}(\mathbf{Q})bold_CF ( bold_Q ) =\displaystyle== 𝐅(𝐂𝐐𝐂)=𝐅(𝐐)𝐅superscript𝐂𝐐𝐂top𝐅superscript𝐐\displaystyle\mathbf{F}(\mathbf{C}\mathbf{Q}\mathbf{C}^{\top})=\mathbf{F}(\mathbf{Q}^{\prime})bold_F ( bold_CQC start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = bold_F ( bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

for any 𝐂O(n)𝐂O𝑛\mathbf{C}\in\mathrm{O}(n)bold_C ∈ roman_O ( italic_n ). It is easy to see that the previous setting of permutation invariance and equivariance is a particular case, 125125125This follows from the orthogonality of permutation matrices, 𝐏𝐏=𝐈superscript𝐏top𝐏𝐈\mathbf{P}^{\top}\mathbf{P}=\mathbf{I}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P = bold_I.which can be thought of as a trivial remeshing in which only the order of nodes is changed.

Wang et al. (2019a) showed that given an eigendecomposition of the operator 𝐐=𝐕𝚲𝐕𝐐𝐕𝚲superscript𝐕top\mathbf{Q}=\mathbf{V}\bm{\Lambda}\mathbf{V}^{\top}bold_Q = bold_V bold_Λ bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, any remeshing invariant (or equivariant) function can be expressed as f(𝐐)=f(𝚲)𝑓𝐐𝑓𝚲f(\mathbf{Q})=f(\bm{\Lambda})italic_f ( bold_Q ) = italic_f ( bold_Λ ) and 𝐅(𝐐)=𝐕𝐅(𝚲)𝐅𝐐𝐕𝐅𝚲\mathbf{F}(\mathbf{Q})=\mathbf{V}\mathbf{F}(\bm{\Lambda})bold_F ( bold_Q ) = bold_VF ( bold_Λ ), or in other words, remeshing-invariant functions involve only the spectrum of 𝐐𝐐\mathbf{Q}bold_Q. Indeed, functions of Laplacian eigenvalues have been proven in practice to be robust to surface discretisation and perturbation, explaining the popularity of spectral constructions based on Laplacians in computer graphics, as well as in deep learning on graph (Defferrard et al., 2016; Levie et al., 2018). Since this result refers to a generic operator 𝐐𝐐\mathbf{Q}bold_Q, multiple choices are available besides the ubiquitous Laplacian – notable examples include the Dirac (Liu et al., 2017; Kostrikov et al., 2018) or Steklov (Wang et al., 2018) operators, as well as learnable parametric operators (Wang et al., 2019a).

5 Geometric Deep Learning Models

Having thoroughly studied various instantiations of our Geometric Deep Learning blueprint (for different choices of domain, symmetry group, and notions of locality), we are ready to discuss how enforcing these prescriptions can yield some of the most popular deep learning architectures.

Our exposition, once again, will not be in strict order of generality. We initially cover three architectures for which the implementation follows nearly-directly from our preceding discussion: convolutional neural networks (CNNs), group-equivariant CNNs, and graph neural networks (GNNs).

We will then take a closer look into variants of GNNs for cases where a graph structure is not known upfront (i.e. unordered sets), and through our discussion we will describe the popular Deep Sets and Transformer architectures as instances of GNNs.

Following our discussion on geometric graphs and meshes, we first describe equivariant message passing networks, which introduce explicit geometric symmetries into GNN computations. Then, we show ways in which our theory of geodesics and gauge symmetries can be materialised within deep learning, recovering a family of intrinsic mesh CNNs (including Geodesic CNNs, MoNet and gauge-equivariant mesh CNNs).

Lastly, we look back on the grid domain from a temporal angle. This discussion will lead us to recurrent neural networks (RNNs). We will demonstrate a manner in which RNNs are translation equivariant over temporal grids, but also study their stability to time warping transformations. This property is highly desirable for properly handling long-range dependencies, and enforcing class invariance to such transformations yields exactly the class of gated RNNs (including popular RNN models such as the LSTM or GRU).

While we hope the above canvasses most of the key deep learning architectures in use at the time of writing, we are well aware that novel neural network instances are proposed daily. Accordingly, rather than aiming to cover every possible architecture, we hope that the following sections are illustrative enough, to the point that the reader is able to easily categorise any future Geometric Deep Learning developments using the lens of invariances and symmetries.

5.1 Convolutional Neural Networks

Convolutional Neural Networks are perhaps the earliest and most well known example of deep learning architectures following the blueprint of Geometric Deep Learning outlined in Section 3.5. In Section 4.2 we have fully characterised the class of linear and local translation equivariant operators, given by convolutions 𝐂(𝜽)𝐱=𝐱𝜽𝐂𝜽𝐱𝐱𝜽{\bf C(\bm{\theta})}{\bf x}={\bf x}\star\bm{\theta}bold_C ( bold_italic_θ ) bold_x = bold_x ⋆ bold_italic_θ with a localised filter 𝜽𝜽\bm{\theta}bold_italic_θ126126126Recall, 𝐂(𝜽)𝐂𝜽{\bf C}(\bm{\theta})bold_C ( bold_italic_θ ) is a circulant matrix with parameters 𝜽𝜽\bm{\theta}bold_italic_θ.. Let us first focus on scalar-valued (‘single-channel’ or ‘grayscale’) discretised images, where the domain is the grid Ω=[H]×[W]Ωdelimited-[]𝐻delimited-[]𝑊\Omega=[H]\times[W]roman_Ω = [ italic_H ] × [ italic_W ] with 𝐮=(u1,u2)𝐮subscript𝑢1subscript𝑢2\mathbf{u}=(u_{1},u_{2})bold_u = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and 𝐱𝒳(Ω,)𝐱𝒳Ω{\bf x}\in{\mathcal{X}}(\Omega,\mathbb{R})bold_x ∈ caligraphic_X ( roman_Ω , blackboard_R ).

Any convolution with a compactly supported filter of size Hf×Wfsuperscript𝐻𝑓superscript𝑊𝑓H^{f}\times W^{f}italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT can be written as a linear combination of generators 𝜽1,1,,𝜽Hf,Wfsubscript𝜽11subscript𝜽superscript𝐻𝑓superscript𝑊𝑓\bm{\theta}_{1,1},\dots,\bm{\theta}_{{H^{f},W^{f}}}bold_italic_θ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, given for example by the unit peaks 𝜽vw(u1,u2)=δ(u1v,u2w)subscript𝜽𝑣𝑤subscript𝑢1subscript𝑢2𝛿subscript𝑢1𝑣subscript𝑢2𝑤\bm{\theta}_{vw}(u_{1},u_{2})=\delta(u_{1}-v,u_{2}-w)bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_δ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_v , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_w ). Any local linear equivariant map is thus expressible as127127127Note that we usually imagine 𝐱𝐱\mathbf{x}bold_x and 𝜽vwsubscript𝜽𝑣𝑤\bm{\theta}_{vw}bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT as 2D matrices, but in this equation, both 𝐱𝐱\mathbf{x}bold_x and 𝜽vwsubscript𝜽𝑣𝑤\bm{\theta}_{vw}bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT have their two coordinate dimensions flattened into one—making 𝐱𝐱\mathbf{x}bold_x a vector, and 𝐂(𝜽vw)𝐂subscript𝜽𝑣𝑤\mathbf{C}(\bm{\theta}_{vw})bold_C ( bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT ) a matrix.

𝐅(𝐱)=v=1Hfw=1Wfαvw𝐂(𝜽vw)𝐱,𝐅𝐱superscriptsubscript𝑣1superscript𝐻𝑓superscriptsubscript𝑤1superscript𝑊𝑓subscript𝛼𝑣𝑤𝐂subscript𝜽𝑣𝑤𝐱\mathbf{F}(\mathbf{x})=\sum_{v=1}^{H^{f}}\sum_{w=1}^{W^{f}}\alpha_{vw}\mathbf{C}(\bm{\theta}_{vw})\mathbf{x}~{},bold_F ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT bold_C ( bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT ) bold_x , (26)

which, in coordinates, corresponds to the familiar 2D convolution (see Figure 14 for an overview):

Refer to caption
Figure 14: The process of convolving an image 𝐱𝐱\mathbf{x}bold_x with a filter 𝐂(𝜽)𝐂𝜽\mathbf{C}(\bm{\theta})bold_C ( bold_italic_θ ). The filter parameters 𝜽𝜽\bm{\theta}bold_italic_θ can be expressed as a linear combination of generators 𝜽vwsubscript𝜽𝑣𝑤\bm{\theta}_{vw}bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT.
𝐅(𝐱)uv=a=1Hfb=1Wfαabxu+a,v+b.𝐅subscript𝐱𝑢𝑣superscriptsubscript𝑎1superscript𝐻𝑓superscriptsubscript𝑏1superscript𝑊𝑓subscript𝛼𝑎𝑏subscript𝑥𝑢𝑎𝑣𝑏\mathbf{F}(\mathbf{x})_{uv}=\sum_{a=1}^{H^{f}}\sum_{b=1}^{W^{f}}{\alpha}_{ab}x_{u+a,v+b}~{}.bold_F ( bold_x ) start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_u + italic_a , italic_v + italic_b end_POSTSUBSCRIPT . (27)

Other choices of the basis 𝜽vwsubscript𝜽𝑣𝑤\bm{\theta}_{vw}bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT are also possible and will yield equivalent operations (for potentially different choices of αvwsubscript𝛼𝑣𝑤\alpha_{vw}italic_α start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT). A popular example are directional derivatives: 𝜽vw(u1,u2)=δ(u1,u2)δ(u1v,u2w),(v,w)(0,0)formulae-sequencesubscript𝜽𝑣𝑤subscript𝑢1subscript𝑢2𝛿subscript𝑢1subscript𝑢2𝛿subscript𝑢1𝑣subscript𝑢2𝑤𝑣𝑤00\bm{\theta}_{vw}(u_{1},u_{2})=\delta(u_{1},u_{2})-\delta(u_{1}-v,u_{2}-w),(v,w)\neq(0,0)bold_italic_θ start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_δ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_δ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_v , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_w ) , ( italic_v , italic_w ) ≠ ( 0 , 0 ) taken together with the local average 𝜽0(u1,u2)=1HfWfsubscript𝜽0subscript𝑢1subscript𝑢21subscript𝐻𝑓subscript𝑊𝑓\bm{\theta}_{0}(u_{1},u_{2})=\frac{1}{H_{f}W_{f}}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG. In fact, directional derivatives can be considered a grid-specific analogue of diffusion processes on graphs, which we recover if we assume each pixel to be a node connected to its immediate neighbouring pixels in the grid.

When the scalar input channel is replaced by multiple channels (e.g., RGB colours, or more generally an arbitrary number of feature maps), the convolutional filter becomes a convolutional tensor expressing arbitrary linear combinations of input features into output feature maps. In coordinates, this can be expressed as

𝐅(𝐱)uvj=a=1Hfb=1Wfc=1Mαjabcxu+a,v+b,c,j[N],formulae-sequence𝐅subscript𝐱𝑢𝑣𝑗superscriptsubscript𝑎1superscript𝐻𝑓superscriptsubscript𝑏1superscript𝑊𝑓superscriptsubscript𝑐1𝑀subscript𝛼𝑗𝑎𝑏𝑐subscript𝑥𝑢𝑎𝑣𝑏𝑐𝑗delimited-[]𝑁\mathbf{F}(\mathbf{x})_{uvj}=\sum_{a=1}^{H^{f}}\sum_{b=1}^{W^{f}}\sum_{c=1}^{M}{\alpha}_{jabc}x_{u+a,v+b,c}~{},~{}j\in[N]~{},bold_F ( bold_x ) start_POSTSUBSCRIPT italic_u italic_v italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_a italic_b italic_c end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_u + italic_a , italic_v + italic_b , italic_c end_POSTSUBSCRIPT , italic_j ∈ [ italic_N ] , (28)

where M𝑀Mitalic_M and N𝑁Nitalic_N are respectively the number of input and output channels. This basic operation encompasses a broad class of neural network architectures, which, as we will show in the next section, have had a profound impact across many areas of computer vision, signal processing, and beyond. Here, rather than dissecting the myriad of possible architectural variants of CNNs, we prefer to focus on some of the essential innovations that enabled their widespread use.

Efficient multiscale computation

As discussed in the GDL template for general symmetries, extracting translation invariant features out of the convolutional operator 𝐅𝐅\mathbf{F}bold_F requires a non-linear step.128128128[Uncaptioned image]
ReLU, often considered a ‘modern’ architectural choice, was already used in the Neocognitron (Fukushima and Miyake, 1982). Rectification is equivalent to the principle of demodulation, which is fundamental in electrical engineering as the basis for many transmission protocols, such as FM radio; and also has a prominent role in models for neuronal activity.
Convolutional features are processed through a non-linear activation function σ𝜎\sigmaitalic_σ, acting element-wise on the input—i.e., σ:𝒳(Ω)𝒳(Ω):𝜎𝒳Ω𝒳Ω\sigma:{\mathcal{X}}(\Omega)\to{\mathcal{X}}(\Omega)italic_σ : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω ), as σ(𝐱)(u)=σ(𝐱(u))𝜎𝐱𝑢𝜎𝐱𝑢\sigma(\mathbf{x})(u)=\sigma(\mathbf{x}(u))italic_σ ( bold_x ) ( italic_u ) = italic_σ ( bold_x ( italic_u ) ). Perhaps the most popular example at the time of writing is the Rectified Linear Unit (ReLU): σ(x)=max(x,0)𝜎𝑥𝑥0\sigma(x)=\max(x,0)italic_σ ( italic_x ) = roman_max ( italic_x , 0 ). This non-linearity effectively rectifies the signals, pushing their energy towards lower frequencies, and enabling the computation of high-order interactions across scales by iterating the construction.

Already in the early works of Fukushima and Miyake (1982) and LeCun et al. (1998), CNNs and similar architectures had a multiscale structure, where after each convolutional layer (28) one performs a grid coarsening 𝐏:𝒳(Ω)𝒳(Ω):𝐏𝒳Ω𝒳superscriptΩ\mathbf{P}:{\mathcal{X}}(\Omega)\to{\mathcal{X}}(\Omega^{\prime})bold_P : caligraphic_X ( roman_Ω ) → caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where the grid ΩsuperscriptΩ\Omega^{\prime}roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has coarser resolution than ΩΩ\Omegaroman_Ω. This enables multiscale filters with effectively increasing receptive field, yet retaining a constant number of parameters per scale. Several signal coarsening strategies 𝐏𝐏\mathbf{P}bold_P (referred to as pooling) may be used, the most common are applying a low-pass anti-aliasing filter (e.g. local average) followed by grid downsampling, or non-linear max-pooling.

In summary, a ‘vanilla’ CNN layer can be expressed as the composition of the basic objects already introduced in our Geometric Deep Learning blueprint:

𝐡=𝐏(σ(𝐅(𝐱))),𝐡𝐏𝜎𝐅𝐱\mathbf{h}=\mathbf{P}(\sigma(\mathbf{F}(\mathbf{x})))~{},bold_h = bold_P ( italic_σ ( bold_F ( bold_x ) ) ) , (29)

i.e. an equivariant linear layer 𝐅𝐅\mathbf{F}bold_F, a coarsening operation 𝐏𝐏\mathbf{P}bold_P, and a non-linearity σ𝜎\sigmaitalic_σ. It is also possible to perform translation invariant global pooling operations within CNNs. Intuitively, this involves each pixel—which, after several convolutions, summarises a patch centered around it—proposing the final representation of the image129129129CNNs which only consist of the operations mentioned in this paragraph are often dubbed “all-convolutional”. In contrast, many CNNs flatten the image across the spatial axes and pass them to an MLP classifier, once sufficient equivariant and coarsening layers have been applied. This loses translation invariance., with the ultimate choice being guided by a form of aggregation of these proposals. A popular choice here is the average function, as its outputs will retain similar magnitudes irrespective of the image size (Springenberg et al., 2014).

Prominent examples following this CNN blueprint (some of which we will discuss next) are displayed in Figure 15.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: Prominent examples of CNN architectures. Top-to-bottom: LeNet (LeCun et al., 1998), AlexNet (Krizhevsky et al., 2012), ResNet (He et al., 2016) and U-Net (Ronneberger et al., 2015). Drawn using the PlotNeuralNet package (Iqbal, 2018).

Deep and Residual Networks

A CNN architecture, in its simplest form, is therefore specified by hyperparameters (Hkf,Wkf,Nk,pk)kKsubscriptsubscriptsuperscript𝐻𝑓𝑘subscriptsuperscript𝑊𝑓𝑘subscript𝑁𝑘subscript𝑝𝑘𝑘𝐾(H^{f}_{k},W^{f}_{k},N_{k},p_{k})_{k\leq K}( italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ≤ italic_K end_POSTSUBSCRIPT, with Mk+1=Nksubscript𝑀𝑘1subscript𝑁𝑘M_{k+1}=N_{k}italic_M start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and pk=0,1subscript𝑝𝑘01p_{k}=0,1italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 , 1 indicating whether grid coarsening is performed or not. While all these hyperparameters are important in practice, a particularly important question is to understand the role of depth K𝐾Kitalic_K in CNN architectures, and what are the fundamental tradeoffs involved in choosing such a key hyperparameter, especially in relation to the filter sizes (Hkf,Wkfsuperscriptsubscript𝐻𝑘𝑓superscriptsubscript𝑊𝑘𝑓H_{k}^{f},W_{k}^{f}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT).

While a rigorous answer to this question is still elusive, mounting empirical evidence collected throughout the recent years suggests a favourable tradeoff towards deeper (large K𝐾Kitalic_K) yet thinner (small (Hkf,Wkf)subscriptsuperscript𝐻𝑓𝑘subscriptsuperscript𝑊𝑓𝑘(H^{f}_{k},W^{f}_{k})( italic_H start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )) models 130130130 Historically, ResNet models are predated by highway networks (Srivastava et al., 2015), which allow for more general gating mechanisms to control the residual information flow.. In this context, a crucial insight by He et al. (2016) was to reparametrise each convolutional layer to model a perturbation of the previous features, rather than a generic non-linear transformation:

𝐡=𝐏(𝐱+σ(𝐅(𝐱))).𝐡𝐏𝐱𝜎𝐅𝐱\mathbf{h}=\mathbf{P}\left(\mathbf{x}+\sigma(\mathbf{F}(\mathbf{x}))\right)~{}.bold_h = bold_P ( bold_x + italic_σ ( bold_F ( bold_x ) ) ) . (30)

The resulting residual networks provide several key advantages over the previous formulation. In essence, the residual parametrisation is consistent with the view that the deep network is a discretisation of an underlying continuous dynamical system, modelled as an ordinary differential equation (ODE)131131131In this case, the ResNet is performing a Forward Euler discretisation of an ODE: 𝐱˙=σ(𝐅(𝐱))˙𝐱𝜎𝐅𝐱\dot{\mathbf{x}}=\sigma(\mathbf{F}(\mathbf{x}))over˙ start_ARG bold_x end_ARG = italic_σ ( bold_F ( bold_x ) ). Crucially, learning a dynamical system by modeling its velocity turns out to be much easier than learning its position directly. In our learning setup, this translates into an optimisation landscape with more favorable geometry, leading to the ability to train much deeper architectures than was possible before. As will be discussed in future work, learning using deep neural networks defines a non-convex optimisation problem, which can be efficiently solved using gradient-descent methods under certain simplifying regimes. The key advantage of the ResNet parametrisation has been rigorously analysed in simple scenarios (Hardt and Ma, 2016), and remains an active area of theoretical investigation. Finally, Neural ODEs (Chen et al., 2018) are a recent popular architecture that pushes the analogy with ODEs even further, by learning the parameters of the ODE 𝐱˙=σ(𝐅(𝐱))˙𝐱𝜎𝐅𝐱\dot{\mathbf{x}}=\sigma(\mathbf{F}(\mathbf{x}))over˙ start_ARG bold_x end_ARG = italic_σ ( bold_F ( bold_x ) ) directly and relying on standard numerical integration.

Normalisation

Another important algorithmic innovation that boosted the empirical performance of CNNs significantly is the notion of normalisation. In early models of neural activity, it was hypothesised that neurons perform some form of local ‘gain control’, where the layer coefficients 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are replaced by 𝐱~k=σk1(𝐱kμk)subscript~𝐱𝑘direct-productsuperscriptsubscript𝜎𝑘1subscript𝐱𝑘subscript𝜇𝑘\mathbf{\tilde{x}}_{k}=\sigma_{k}^{-1}\odot(\mathbf{x}_{k}-\mu_{k})over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⊙ ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Here, μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT encode the first and second-order moment information of 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively. Further, they can be either computed globally or locally.

In the context of Deep Learning, this principle was widely adopted through the batch normalisation layer (Ioffe and Szegedy, 2015)132132132We note that normalising activations of neural networks has seen attention even before the advent of batch normalisation. See, e.g., Lyu and Simoncelli (2008)., followed by several variants (Ba et al., 2016; Salimans and Kingma, 2016; Ulyanov et al., 2016; Cooijmans et al., 2016; Wu and He, 2018). Despite some attempts to rigorously explain the benefits of normalisation in terms of better conditioned optimisation landscapes (Santurkar et al., 2018), a general theory that can provide guiding principles is still missing at the time of writing.

Data augmentation

While CNNs encode the geometric priors associated with translation invariance and scale separation, they do not explicitly account for other known transformations that preserve semantic information, e.g lightning or color changes, or small rotations and dilations. A pragmatic approach to incorporate these priors with minimal architectural changes is to perform data augmentation, where one manually performs said transformations to the input images and adds them into the training set.

Data augmentation has been empirically successful and is widely used—not only to train state-of-the-art vision architectures, but also to prop up several developments in self-supervised and causal representation learning (Chen et al., 2020; Grill et al., 2020; Mitrovic et al., 2020). However, it is provably sub-optimal in terms of sample complexity (Mei et al., 2021); a more efficient strategy considers instead architectures with richer invariance groups—as we discuss next.

5.2 Group-equivariant CNNs

As discussed in Section 4.3, we can generalise the convolution operation from signals on a Euclidean space to signals on any homogeneous space ΩΩ\Omegaroman_Ω acted upon by a group 𝔊𝔊{\mathfrak{G}}fraktur_G133133133Recall that a homogeneous space is a set ΩΩ\Omegaroman_Ω equipped with a transitive group action, meaning that for any u,vΩ𝑢𝑣Ωu,v\in\Omegaitalic_u , italic_v ∈ roman_Ω there exists 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G such that 𝔤.u=vformulae-sequence𝔤𝑢𝑣{\mathfrak{g}}.u=vfraktur_g . italic_u = italic_v.. By analogy to the Euclidean convolution where a translated filter is matched with the signal, the idea of group convolution is to move the filter around the domain using the group action, e.g. by rotating and translating. By virtue of the transitivity of the group action, we can move the filter to any position on ΩΩ\Omegaroman_Ω. In this section, we will discuss several concrete examples of the general idea of group convolution, including implementation aspects and architectural choices.

Discrete group convolution

We begin by considering the case where the domain ΩΩ\Omegaroman_Ω as well as the group 𝔊𝔊{\mathfrak{G}}fraktur_G are discrete. As our first example, we consider medical volumetric images represented as signals of on 3D grids with discrete translation and rotation symmetries. The domain is the 3D cubical grid Ω=3Ωsuperscript3\Omega=\mathbb{Z}^{3}roman_Ω = blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and the images (e.g. MRI or CT 3D scans) are modelled as functions x:3:𝑥superscript3x:\mathbb{Z}^{3}\rightarrow\mathbb{R}italic_x : blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R, i.e. x𝒳(Ω)𝑥𝒳Ωx\in\mathcal{X}(\Omega)italic_x ∈ caligraphic_X ( roman_Ω ). Although in practice such images have support on a finite cuboid [W]×[H]×[D]3delimited-[]𝑊delimited-[]𝐻delimited-[]𝐷superscript3[W]\times[H]\times[D]\subset\mathbb{Z}^{3}[ italic_W ] × [ italic_H ] × [ italic_D ] ⊂ blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, we instead prefer to view them as functions on 3superscript3\mathbb{Z}^{3}blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT with appropriate zero padding. As our symmetry, we consider the group 𝔊=3Oh𝔊right-normal-factor-semidirect-productsuperscript3subscript𝑂{\mathfrak{G}}=\mathbb{Z}^{3}\rtimes O_{h}fraktur_G = blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋊ italic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of distance- and orientation-preserving transformations on 3superscript3\mathbb{Z}^{3}blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. This group consists of translations (3superscript3\mathbb{Z}^{3}blackboard_Z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and the discrete rotations Ohsubscript𝑂O_{h}italic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT generated by 90909090 degree rotations about the three axes (see Figure 16).

Refer to caption
Figure 16: A 3×3333\times 33 × 3 filter, rotated by all 24242424 elements of the discrete rotation group Ohsubscript𝑂O_{h}italic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, generated by 90909090-degree rotations about the vertical axis (red arrows), and 120120120120-degree rotations about a diagonal axis (blue arrows).

As our second example, we consider DNA134134134DNA is a biopolymer molecule made of four repeating units called nucleotides (Cytosine, Guanine, Adenine, and Thymine), arranged into two strands coiled around each other in a double helix, where each nucleotide occurs opposite of the complementary one (base pairs A/T and C/G). sequences made up of four letters: C, G, A, and T. The sequences can be represented on the 1D grid Ω=Ω\Omega=\mathbb{Z}roman_Ω = blackboard_Z as signals x:4:𝑥superscript4x:\mathbb{Z}\rightarrow\mathbb{R}^{4}italic_x : blackboard_Z → blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, where each letter is one-hot coded in 4superscript4\mathbb{R}^{4}blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Naturally, we have a discrete 1D translation symmetry on the grid, but DNA sequences have an additional interesting symmetry. This symmetry arises from the way DNA is physically embodied as a double helix, and the way it is read by the molecular machinery of the cell. Each strand of the double helix begins with what is called the 5superscript55^{\prime}5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-end and ends with a 3superscript33^{\prime}3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-end, with the 5superscript55^{\prime}5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on one strand complemented by a 3superscript33^{\prime}3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on the other strand. In other words, the two strands have an opposite orientation.135135135[Uncaptioned image]
A schematic of the DNA’s double helix structure, with the two strands coloured in blue and red. Note how the sequences in the helices are complementary and read in reverse (from 5’ to 3’).
Since the DNA molecule is always read off starting at the 5superscript55^{\prime}5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-end, but we do not know which one, a sequence such as ACCCTGG is equivalent to the reversed sequence with each letter replaced by its complement, CCAGGGT. This is called reverse-complement symmetry of the letter sequence. We thus have the two-element group 2={0,1}subscript201\mathbb{Z}_{2}=\{0,1\}blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { 0 , 1 } corresponding to the identity 00 and reverse-complement transformation 1111 (and composition 1+1=0mod211modulo021+1=0\mod{2}1 + 1 = 0 roman_mod 2). The full group combines translations and reverse-complement transformations.

In our case, the group convolution (14) we defined in Section 4.3 is given as

(xθ)(𝔤)=uΩxuρ(𝔤)θu,𝑥𝜃𝔤subscript𝑢Ωsubscript𝑥𝑢𝜌𝔤subscript𝜃𝑢(x\star\theta)({\mathfrak{g}})=\sum_{u\in\Omega}x_{u}\rho({\mathfrak{g}})\theta_{u},( italic_x ⋆ italic_θ ) ( fraktur_g ) = ∑ start_POSTSUBSCRIPT italic_u ∈ roman_Ω end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_ρ ( fraktur_g ) italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (31)

the inner product between the (single-channel) input signal x𝑥xitalic_x and a filter θ𝜃\thetaitalic_θ transformed by 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G via ρ(𝔤)θu=θ𝔤1u𝜌𝔤subscript𝜃𝑢subscript𝜃superscript𝔤1𝑢\rho({\mathfrak{g}})\theta_{u}=\theta_{{\mathfrak{g}}^{-1}u}italic_ρ ( fraktur_g ) italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT, and the output xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ is a function on 𝔊𝔊{\mathfrak{G}}fraktur_G. Note that since ΩΩ\Omegaroman_Ω is discrete, we have replaced the integral from equation (14) by a sum.

Transform+Convolve approach

We will show that the group convolution can be implemented in two steps: a filter transformation step, and a translational convolution step. The filter transformation step consists of creating rotated (or reverse-complement transformed) copies of a basic filter, while the translational convolution is the same as in standard CNNs and thus efficiently computable on hardware such as GPUs. To see this, note that in both of our examples we can write a general transformation 𝔤𝔊𝔤𝔊{\mathfrak{g}}\in{\mathfrak{G}}fraktur_g ∈ fraktur_G as a transformation 𝔥𝔥{\mathfrak{h}}\in{\mathfrak{H}}fraktur_h ∈ fraktur_H (e.g. a rotation or reverse-complement transformation) followed by a translation 𝔨d𝔨superscript𝑑{\mathfrak{k}}\in\mathbb{Z}^{d}fraktur_k ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, i.e. 𝔤=𝔨𝔥𝔤𝔨𝔥{\mathfrak{g}}={\mathfrak{k}}{\mathfrak{h}}fraktur_g = fraktur_k fraktur_h (with juxtaposition denoting the composition of the group elements 𝔨𝔨{\mathfrak{k}}fraktur_k and 𝔥𝔥{\mathfrak{h}}fraktur_h). By properties of the group representation, we have ρ(𝔤)=ρ(𝔨𝔥)=ρ(𝔨)ρ(𝔥)𝜌𝔤𝜌𝔨𝔥𝜌𝔨𝜌𝔥\rho({\mathfrak{g}})=\rho({\mathfrak{k}}{\mathfrak{h}})=\rho({\mathfrak{k}})\rho({\mathfrak{h}})italic_ρ ( fraktur_g ) = italic_ρ ( fraktur_k fraktur_h ) = italic_ρ ( fraktur_k ) italic_ρ ( fraktur_h ). Thus,

(xθ)(𝔨𝔥)𝑥𝜃𝔨𝔥\displaystyle(x\star\theta)({\mathfrak{k}}{\mathfrak{h}})( italic_x ⋆ italic_θ ) ( fraktur_k fraktur_h ) =uΩxuρ(𝔨)ρ(𝔥)θuabsentsubscript𝑢Ωsubscript𝑥𝑢𝜌𝔨𝜌𝔥subscript𝜃𝑢\displaystyle=\sum_{u\in\Omega}x_{u}\rho({\mathfrak{k}})\rho({\mathfrak{h}})\theta_{u}= ∑ start_POSTSUBSCRIPT italic_u ∈ roman_Ω end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_ρ ( fraktur_k ) italic_ρ ( fraktur_h ) italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (32)
=uΩxu(ρ(𝔥)θ)u𝔨absentsubscript𝑢Ωsubscript𝑥𝑢subscript𝜌𝔥𝜃𝑢𝔨\displaystyle=\sum_{u\in\Omega}x_{u}(\rho({\mathfrak{h}})\theta)_{u-{\mathfrak{k}}}= ∑ start_POSTSUBSCRIPT italic_u ∈ roman_Ω end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_ρ ( fraktur_h ) italic_θ ) start_POSTSUBSCRIPT italic_u - fraktur_k end_POSTSUBSCRIPT

We recognise the last equation as the standard (planar Euclidean) convolution of the signal x𝑥xitalic_x and the transformed filter ρ(𝔥)θ𝜌𝔥𝜃\rho({\mathfrak{h}})\thetaitalic_ρ ( fraktur_h ) italic_θ. Thus, to implement group convolution for these groups, we take the canonical filter θ𝜃\thetaitalic_θ, create transformed copies θ𝔥=ρ(𝔥)θsubscript𝜃𝔥𝜌𝔥𝜃\theta_{\mathfrak{h}}=\rho({\mathfrak{h}})\thetaitalic_θ start_POSTSUBSCRIPT fraktur_h end_POSTSUBSCRIPT = italic_ρ ( fraktur_h ) italic_θ for each 𝔥𝔥{\mathfrak{h}}\in{\mathfrak{H}}fraktur_h ∈ fraktur_H (e.g. each rotation 𝔥Oh𝔥subscript𝑂{\mathfrak{h}}\in O_{h}fraktur_h ∈ italic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT or reverse-complement DNA symmetry 𝔥2𝔥subscript2{\mathfrak{h}}\in\mathbb{Z}_{2}fraktur_h ∈ blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and then convolve x𝑥xitalic_x with each of these filters: (xθ)(𝔨𝔥)=(xθ𝔥)(𝔨)𝑥𝜃𝔨𝔥𝑥subscript𝜃𝔥𝔨(x\star\theta)({\mathfrak{k}}{\mathfrak{h}})=(x\star\theta_{{\mathfrak{h}}})({\mathfrak{k}})( italic_x ⋆ italic_θ ) ( fraktur_k fraktur_h ) = ( italic_x ⋆ italic_θ start_POSTSUBSCRIPT fraktur_h end_POSTSUBSCRIPT ) ( fraktur_k ). For both of our examples, the symmetries act on filters by simply permuting the filter coefficients, as shown in Figure 16 for discrete rotations. Hence, these operations can be implemented efficiently using an indexing operation with pre-computed indices.

While we defined the feature maps output by the group convolution xθ𝑥𝜃x\star\thetaitalic_x ⋆ italic_θ as functions on 𝔊𝔊{\mathfrak{G}}fraktur_G, the fact that we can split 𝔤𝔤{\mathfrak{g}}fraktur_g into 𝔥𝔥{\mathfrak{h}}fraktur_h and 𝔨𝔨{\mathfrak{k}}fraktur_k means that we can also think of them as a stack of Euclidean feature maps (sometimes called orientation channels), with one feature map per filter transformation / orientation 𝔨𝔨{\mathfrak{k}}fraktur_k. For instance, in our first example we would associate to each filter rotation (each node in Figure 16) a feature map, which is obtained by convolving (in the traditional translational sense) the rotated filter. These feature maps can thus still be stored as a W×H×C𝑊𝐻𝐶W\times H\times Citalic_W × italic_H × italic_C array, where the number of channels C𝐶Citalic_C equals the number of independent filters times the number of transformations 𝔥𝔥{\mathfrak{h}}\in{\mathfrak{H}}fraktur_h ∈ fraktur_H (e.g. rotations).

As shown in Section 4.3, the group convolution is equivariant: (ρ(𝔤)x)θ=ρ(𝔤)(xθ)𝜌𝔤𝑥𝜃𝜌𝔤𝑥𝜃(\rho({\mathfrak{g}})x)\star\theta=\rho({\mathfrak{g}})(x\star\theta)( italic_ρ ( fraktur_g ) italic_x ) ⋆ italic_θ = italic_ρ ( fraktur_g ) ( italic_x ⋆ italic_θ ). What this means in terms of orientation channels is that under the action of 𝔥𝔥{\mathfrak{h}}fraktur_h, each orientation channel is transformed, and the orientation channels themselves are permuted. For instance, if we associate one orientation channel per transformation in Figure 16 and apply a rotation by 90909090 degrees about the z-axis (corresponding to the red arrows), the feature maps will be permuted as shown by the red arrows. This description makes it clear that a group convolutional neural network bears much similarity to a traditional CNN. Hence, many of the network design patterns discussed in the Section 5.1, such as residual networks, can be used with group convolutions as well.

Spherical CNNs in the Fourier domain

For the continuous symmetry group of the sphere that we saw in Section 4.3, it is possible to implement the convolution in the spectral domain, using the appropriate Fourier transform (we remind the reader that the convolution on 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a function on SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ), hence we need to define the Fourier transform on both these domains in order to implement multi-layer spherical CNNs). Spherical harmonics are an orthogonal basis on the 2D sphere, analogous to the classical Fourier basis of complex exponential. On the special orthogonal group, the Fourier basis is known as the Wigner D-functions. In both cases, the Fourier transforms (coefficients) are computed as the inner product with the basis functions, and an analogy of the Convolution Theorem holds: one can compute the convolution in the Fourier domain as the element-wise product of the Fourier transforms. Furthermore, FFT-like algorithms exist for the efficient computation of Fourier transform on 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ). We refer for further details to Cohen et al. (2018).

5.3 Graph Neural Networks

Graph Neural Networks (GNNs) are the realisation of our Geometric Deep Learning blueprint on graphs leveraging the properties of the permutation group. GNNs are among the most general class of deep learning architectures currently in existence, and as we will see in this text, most other deep learning architectures can be understood as a special case of the GNN with additional geometric structure.

As per our discussion in Section 4.1, we consider a graph to be specified with an adjacency matrix 𝐀𝐀\mathbf{A}bold_A and node features 𝐗𝐗\mathbf{X}bold_X. We will study GNN architectures that are permutation equivariant functions 𝐅(𝐗,𝐀)𝐅𝐗𝐀\mathbf{F}(\mathbf{X},\mathbf{A})bold_F ( bold_X , bold_A ) constructed by applying shared permutation invariant functions ϕ(𝐱u,𝐗𝒩u)italic-ϕsubscript𝐱𝑢subscript𝐗subscript𝒩𝑢\phi(\mathbf{x}_{u},\mathbf{X}_{\mathcal{N}_{u}})italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) over local neighbourhoods. Under various guises, this local function ϕitalic-ϕ\phiitalic_ϕ can be referred to as “diffusion”, “propagation”, or “message passing”, and the overall computation of such 𝐅𝐅\mathbf{F}bold_F as a “GNN layer”.

The design and study of GNN layers is one of the most active areas of deep learning at the time of writing, making it a landscape that is challenging to navigate. Fortunately, we find that the vast majority of the literature may be derived from only three “flavours” of GNN layers (Figure 17), which we will present here. These flavours govern the extent to which ϕitalic-ϕ\phiitalic_ϕ transforms the neighbourhood features, allowing for varying degrees of complexity when modelling interactions across the graph.

In all three flavours, permutation invariance is ensured by aggregating features from 𝐗𝒩usubscript𝐗subscript𝒩𝑢\mathbf{X}_{\mathcal{N}_{u}}bold_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT (potentially transformed, by means of some function ψ𝜓\psiitalic_ψ) with some permutation-invariant function direct-sum\bigoplus, and then updating the features of node u𝑢uitalic_u, by means of some function ϕitalic-ϕ\phiitalic_ϕ. Typically,136136136Most commonly, ψ𝜓\psiitalic_ψ and ϕitalic-ϕ\phiitalic_ϕ are learnable affine transformations with activation functions; e.g. ψ(𝐱)=𝐖𝐱+𝐛𝜓𝐱𝐖𝐱𝐛\psi(\mathbf{x})={\bf W}\mathbf{x}+\mathbf{b}italic_ψ ( bold_x ) = bold_Wx + bold_b; ϕ(𝐱,𝐳)=σ(𝐖𝐱+𝐔𝐳+𝐛)italic-ϕ𝐱𝐳𝜎𝐖𝐱𝐔𝐳𝐛\phi(\mathbf{x},\mathbf{z})=\sigma\left({\bf W}\mathbf{x}+{\bf U}\mathbf{z}+\mathbf{b}\right)italic_ϕ ( bold_x , bold_z ) = italic_σ ( bold_Wx + bold_Uz + bold_b ), where 𝐖,𝐔,𝐛𝐖𝐔𝐛{\bf W},{\bf U},\mathbf{b}bold_W , bold_U , bold_b are learnable parameters and σ𝜎\sigmaitalic_σ is an activation function such as the rectified linear unit. The additional input of 𝐱usubscript𝐱𝑢\mathbf{x}_{u}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to ϕitalic-ϕ\phiitalic_ϕ represents an optional skip-connection, which is often very useful.ψ𝜓\psiitalic_ψ and ϕitalic-ϕ\phiitalic_ϕ are learnable, whereas direct-sum\bigoplus is realised as a nonparametric operation such as sum, mean, or maximum, though it can also be constructed e.g. using recurrent neural networks (Murphy et al., 2018).

Refer to caption
Refer to caption
Refer to caption
Figure 17: A visualisation of the dataflow for the three flavours of GNN layers, g𝑔gitalic_g. We use the neighbourhood of node b𝑏bitalic_b from Figure 10 to illustrate this. Left-to-right: convolutional, where sender node features are multiplied with a constant, cuvsubscript𝑐𝑢𝑣c_{uv}italic_c start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT; attentional, where this multiplier is implicitly computed via an attention mechanism of the receiver over the sender: αuv=a(𝐱u,𝐱v)subscript𝛼𝑢𝑣𝑎subscript𝐱𝑢subscript𝐱𝑣\alpha_{uv}=a(\mathbf{x}_{u},\mathbf{x}_{v})italic_α start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ); and message-passing, where vector-based messages are computed based on both the sender and receiver: 𝐦uv=ψ(𝐱u,𝐱v)subscript𝐦𝑢𝑣𝜓subscript𝐱𝑢subscript𝐱𝑣\mathbf{m}_{uv}=\psi(\mathbf{x}_{u},\mathbf{x}_{v})bold_m start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ).

In the convolutional flavour (Kipf and Welling, 2016a; Defferrard et al., 2016; Wu et al., 2019), the features of the neighbourhood nodes are directly aggregated with fixed weights,

𝐡u=ϕ(𝐱u,v𝒩ucuvψ(𝐱v)).subscript𝐡𝑢italic-ϕsubscript𝐱𝑢subscriptdirect-sum𝑣subscript𝒩𝑢subscript𝑐𝑢𝑣𝜓subscript𝐱𝑣\mathbf{h}_{u}=\phi\left(\mathbf{x}_{u},\bigoplus\limits_{v\in\mathcal{N}_{u}}c_{uv}\psi(\mathbf{x}_{v})\right).bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) . (33)

Here, cuvsubscript𝑐𝑢𝑣c_{uv}italic_c start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT specifies the importance of node v𝑣vitalic_v to node u𝑢uitalic_u’s representation. It is a constant that often directly depends on the entries in 𝐀𝐀{\bf A}bold_A representing the structure of the graph. Note that when the aggregation operator direct-sum\bigoplus is chosen to be the summation, it can be considered as a linear diffusion or position-dependent linear filtering, a generalisation of convolution.137137137It is worthy to note that this flavour does not express every GNN layer that is convolutional (in the sense of commuting with the graph structure), but covers most such approaches proposed in practice. We will provide detailed discussion and extensions in future work.In particular, the spectral filters we have seen in Sections 4.4 and 4.6 fall under this category, as they amount to applying fixed local operators (e.g. the Laplacian matrix) to node-wise signals.

In the attentional flavour (Veličković et al., 2018; Monti et al., 2017; Zhang et al., 2018), the interactions are implicit

𝐡u=ϕ(𝐱u,v𝒩ua(𝐱u,𝐱v)ψ(𝐱v)).subscript𝐡𝑢italic-ϕsubscript𝐱𝑢subscriptdirect-sum𝑣subscript𝒩𝑢𝑎subscript𝐱𝑢subscript𝐱𝑣𝜓subscript𝐱𝑣\mathbf{h}_{u}=\phi\left(\mathbf{x}_{u},\bigoplus\limits_{v\in\mathcal{N}_{u}}a(\mathbf{x}_{u},\mathbf{x}_{v})\psi(\mathbf{x}_{v})\right).bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) . (34)

Here, a𝑎aitalic_a is a learnable self-attention mechanism that computes the importance coefficients αuv=a(𝐱u,𝐱v)subscript𝛼𝑢𝑣𝑎subscript𝐱𝑢subscript𝐱𝑣\alpha_{uv}=a(\mathbf{x}_{u},\mathbf{x}_{v})italic_α start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) implicitly. They are often softmax-normalised across all neighbours. When direct-sum\bigoplus is the summation, the aggregation is still a linear combination of the neighbourhood node features, but now the weights are feature-dependent.

Finally, the message-passing flavour (Gilmer et al., 2017; Battaglia et al., 2018) amounts to computing arbitrary vectors (“messages”) across edges,

𝐡u=ϕ(𝐱u,v𝒩uψ(𝐱u,𝐱v)).subscript𝐡𝑢italic-ϕsubscript𝐱𝑢subscriptdirect-sum𝑣subscript𝒩𝑢𝜓subscript𝐱𝑢subscript𝐱𝑣\mathbf{h}_{u}=\phi\left(\mathbf{x}_{u},\bigoplus\limits_{v\in\mathcal{N}_{u}}\psi(\mathbf{x}_{u},\mathbf{x}_{v})\right).bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) . (35)

Here, ψ𝜓\psiitalic_ψ is a learnable message function, computing v𝑣vitalic_v’s vector sent to u𝑢uitalic_u, and the aggregation can be considered as a form of message passing on the graph.

One important thing to note is a representational containment between these approaches: convolution \subseteq attention \subseteq message-passing. Indeed, attentional GNNs can represent convolutional GNNs by an attention mechanism implemented as a look-up table a(𝐱u,𝐱v)=cuv𝑎subscript𝐱𝑢subscript𝐱𝑣subscript𝑐𝑢𝑣a(\mathbf{x}_{u},\mathbf{x}_{v})=c_{uv}italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT, and both convolutional and attentional GNNs are special cases of message-passing where the messages are only the sender nodes’ features: ψ(𝐱u,𝐱v)=cuvψ(𝐱v)𝜓subscript𝐱𝑢subscript𝐱𝑣subscript𝑐𝑢𝑣𝜓subscript𝐱𝑣\psi(\mathbf{x}_{u},\mathbf{x}_{v})=c_{uv}\psi(\mathbf{x}_{v})italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for convolutional GNNs and ψ(𝐱u,𝐱v)=a(𝐱u,𝐱v)ψ(𝐱v)𝜓subscript𝐱𝑢subscript𝐱𝑣𝑎subscript𝐱𝑢subscript𝐱𝑣𝜓subscript𝐱𝑣\psi(\mathbf{x}_{u},\mathbf{x}_{v})=a(\mathbf{x}_{u},\mathbf{x}_{v})\psi(\mathbf{x}_{v})italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for attentional GNNs.

This does not imply that message passing GNNs are always the most useful variant; as they have to compute vector-valued messages across edges, they are typically harder to train and require unwieldy amounts of memory. Further, on a wide range of naturally-occurring graphs, the graph’s edges encode for downstream class similarity (i.e. an edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) implies that u𝑢uitalic_u and v𝑣vitalic_v are likely to have the same output). For such graphs (often called homophilous), convolutional aggregation across neighbourhoods is often a far better choice, both in terms of regularisation and scalability. Attentional GNNs offer a “middle-ground”: they allow for modelling complex interactions within neighbourhoods while computing only scalar-valued quantities across the edges, making them more scalable than message-passing.

The “three flavour” categorisation presented here is provided with brevity in mind and inevitably neglects a wealth of nuances, insights, generalisations, and historical contexts to GNN models. Importantly, it excludes higher-dimensional GNN based on the Weisfeiler-Lehman hierarchy and spectral GNNs relying on the explicit computation of the graph Fourier transform.

5.4 Deep Sets, Transformers, and Latent Graph Inference

We close the discussion on GNNs by remarking on permutation-equivariant neural network architectures for learning representations of unordered sets. While sets have the least structure among the domains we have discussed in this text, their importance has been recently highlighted by highly-popular architectures such as Transformers (Vaswani et al., 2017) and Deep Sets (Zaheer et al., 2017). In the language of Section 4.1, we assume that we are given a matrix of node features, 𝐗𝐗{\bf X}bold_X, but without any specified adjacency or ordering information between the nodes. The specific architectures will arise by deciding to what extent to model interactions between the nodes.

Empty edge set

Unordered sets are provided without any additional structure or geometry whatsoever—hence, it could be argued that the most natural way to process them is to treat each set element entirely independently. This translates to a permutation equivariant function over such inputs, which was already introduced in Section 4.1: a shared transformation applied to every node in isolation. Assuming the same notation as when describing GNNs (Section 5.3), such models can be represented as

𝐡u=ψ(𝐱u),subscript𝐡𝑢𝜓subscript𝐱𝑢\mathbf{h}_{u}=\psi(\mathbf{x}_{u}),bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ψ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ,

where ψ𝜓\psiitalic_ψ is a learnable transformation. It may be observed that this is a special case of a convolutional GNN with 𝒩u={u}subscript𝒩𝑢𝑢\mathcal{N}_{u}=\{u\}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_u }—or, equivalently, 𝐀=𝐈𝐀𝐈{\bf A}={\bf I}bold_A = bold_I. Such an architecture is commonly referred to as Deep Sets, in recognition of the work of Zaheer et al. (2017) that have theoretically proved several universal-approximation properties of such architectures. It should be noted that the need to process unordered sets commonly arises in computer vision and graphics when dealing with point clouds; therein, such models are known as PointNets (Qi et al., 2017).

Complete edge set

While assuming an empty edge set is a very efficient construct for building functions over unordered sets, often we would expect that elements of the set exhibit some form of relational structure—i.e., that there exists a latent graph between the nodes. Setting 𝐀=𝐈𝐀𝐈{\bf A}={\bf I}bold_A = bold_I discards any such structure, and may yield suboptimal performance. Conversely, we could assume that, in absence of any other prior knowledge, we cannot upfront exclude any possible links between nodes. In this approach we assume the complete graph, 𝐀=𝟏𝟏𝐀superscript11top{\bf A}={\bf 1}{\bf 1}^{\top}bold_A = bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT; equivalently, 𝒩u=𝒱subscript𝒩𝑢𝒱\mathcal{N}_{u}=\mathcal{V}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = caligraphic_V. As we do not assume access to any coefficients of interaction, running convolutional-type GNNs over such a graph would amount to:

𝐡u=ϕ(𝐱u,v𝒱ψ(𝐱v)),subscript𝐡𝑢italic-ϕsubscript𝐱𝑢subscriptdirect-sum𝑣𝒱𝜓subscript𝐱𝑣\mathbf{h}_{u}=\phi\left(\mathbf{x}_{u},\bigoplus_{v\in\mathcal{V}}\psi(\mathbf{x}_{v})\right),bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) ,

where the second input, v𝒱ψ(𝐱v)subscriptdirect-sum𝑣𝒱𝜓subscript𝐱𝑣\bigoplus_{v\in\mathcal{V}}\psi(\mathbf{x}_{v})⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is identical for all nodes u𝑢uitalic_u138138138This is a direct consequence of the permutation invariance of direct-sum\bigoplus., and as such makes the model equivalently expressive to ignoring that input altogether; i.e. the 𝐀=𝐈𝐀𝐈{\bf A}={\bf I}bold_A = bold_I case mentioned above.

This motivates the use of a more expressive GNN flavour, the attentional,

𝐡u=ϕ(𝐱u,v𝒱a(𝐱u,𝐱v)ψ(𝐱v))subscript𝐡𝑢italic-ϕsubscript𝐱𝑢subscriptdirect-sum𝑣𝒱𝑎subscript𝐱𝑢subscript𝐱𝑣𝜓subscript𝐱𝑣\mathbf{h}_{u}=\phi\left(\mathbf{x}_{u},\bigoplus_{v\in\mathcal{V}}a(\mathbf{x}_{u},\mathbf{x}_{v})\psi(\mathbf{x}_{v})\right)bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_ψ ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) (36)

which yields the self-attention operator, the core of the Transformer architecture (Vaswani et al., 2017). Assuming some kind of normalisation over the attentional coefficients (e.g. softmax), we can constrain all the scalars a(𝐱u,𝐱v)𝑎subscript𝐱𝑢subscript𝐱𝑣a(\mathbf{x}_{u},\mathbf{x}_{v})italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) to be in the range [0,1]01[0,1][ 0 , 1 ]; as such, we can think of self-attention as inferring a soft adjacency matrix, auv=a(𝐱u,𝐱v)subscript𝑎𝑢𝑣𝑎subscript𝐱𝑢subscript𝐱𝑣a_{uv}=a(\mathbf{x}_{u},\mathbf{x}_{v})italic_a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = italic_a ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), as a byproduct of gradient-based optimisation for some downstream task.

The above perspective means that we can pose Transformers exactly as attentional GNNs over a complete graph (Joshi, 2020).139139139It is also appropriate to apply the message-passing flavour. While popular for physics simulations and relational reasoning (e.g. Battaglia et al. (2016); Santoro et al. (2017)), they have not been as widely used as Transformers. This is likely due to the memory issues associated with computing vector messages over a complete graph, or the fact that vector-based messages are less interpretable than the “soft adjacency” provided by self-attention.However, this is in apparent conflict with Transformers being initially proposed for modelling sequences—the representations of 𝐡usubscript𝐡𝑢\mathbf{h}_{u}bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT should be mindful of node u𝑢uitalic_u’s position in the sequence, which complete-graph aggregation would ignore. Transformers address this issue by introducing positional encodings: the node features 𝐱usubscript𝐱𝑢\mathbf{x}_{u}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are augmented to encode node u𝑢uitalic_u’s position in the sequence, typically as samples from a sine wave whose frequency is dependent on u𝑢uitalic_u.

On graphs, where no natural ordering of nodes exists, multiple alternatives were proposed to such positional encodings. While we defer discussing these alternatives for later, we note that one promising direction involves a realisation that the positional encodings used in Transformers can be directly related to the discrete Fourier transform (DFT), and hence to the eigenvectors of the graph Laplacian of a “circular grid”. Hence, Transformers’ positional encodings are implicitly representing our assumption that input nodes are connected in a grid. For more general graph structures, one may simply use the Laplacian eigenvectors of the (assumed) graph—an observation exploited by Dwivedi and Bresson (2020) within their empirically powerful Graph Transformer model.

Inferred edge set

Finally, one can try to learn the latent relational structure, leading to some general 𝐀𝐀{\bf A}bold_A that is neither 𝐈𝐈{\bf I}bold_I nor 𝟏𝟏superscript11top{\bf 1}{\bf 1}^{\top}bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. The problem of inferring a latent adjacency matrix 𝐀𝐀{\bf A}bold_A for a GNN to use (often called latent graph inference) is of high interest for graph representation learning. This is due to the fact that assuming 𝐀=𝐈𝐀𝐈{\bf A}={\bf I}bold_A = bold_I may be representationally inferior, and 𝐀=𝟏𝟏𝐀superscript11top{\bf A}={\bf 1}{\bf 1}^{\top}bold_A = bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT may be challenging to implement due to memory requirements and large neighbourhoods to aggregate over. Additionally, it is closest to the “true” problem: inferring an adjacency matrix 𝐀𝐀{\bf A}bold_A implies detecting useful structure between the rows of 𝐗𝐗{\bf X}bold_X, which may then help formulate hypotheses such as causal relations between variables.

Unfortunately, such a framing necessarily induces a step-up in modelling complexity. Specifically, it requires properly balancing a structure learning objective (which is discrete, and hence challenging for gradient-based optimisation) with any downstream task the graph is used for. This makes latent graph inference a highly challenging and intricate problem.

5.5 Equivariant Message Passing Networks

In many applications of Graph Neural Networks, node features (or parts thereof) are not just arbitrary vectors but coordinates of geometric entities. This is the case, for example, when dealing with molecular graphs: the nodes representing atoms may contain information about the atom type as well as its 3D spatial coordinates. It is desirable to process the latter part of the features in a manner that would transform in the same way as the molecule is transformed in space, in other words, be equivariant to the Euclidean group E(3)E3\mathrm{E}(3)roman_E ( 3 ) of rigid motions (rotations, translations, and reflections) in addition to the standard permutation equivariance discussed before.

To set the stage for our (slightly simplified) analysis, we will make a distinction between node features 𝐟udsubscript𝐟𝑢superscript𝑑\mathbf{f}_{u}\in\mathbb{R}^{d}bold_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and node spatial coordinates 𝐱u3subscript𝐱𝑢superscript3\mathbf{x}_{u}\in\mathbb{R}^{3}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT; the latter are endowed with Euclidean symmetry structure. In this setting, an equivariant layer explicitly transforms these two inputs separately, yielding modified node features 𝐟usubscriptsuperscript𝐟𝑢\mathbf{f}^{\prime}_{u}bold_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and coordinates 𝐱usubscriptsuperscript𝐱𝑢\mathbf{x}^{\prime}_{u}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

We can now state our desirable equivariance property, following the Geometric Deep Learning blueprint. If the spatial component of the input is transformed by 𝔤E(3)𝔤E3{\mathfrak{g}}\in\mathrm{E}(3)fraktur_g ∈ roman_E ( 3 ) (represented as ρ(𝔤)𝐱=𝐑𝐱+𝐛𝜌𝔤𝐱𝐑𝐱𝐛\rho({\mathfrak{g}})\mathbf{x}=\mathbf{R}\mathbf{x}+\mathbf{b}italic_ρ ( fraktur_g ) bold_x = bold_Rx + bold_b, where 𝐑𝐑\mathbf{R}bold_R is an orthogonal matrix modeling rotations and reflections, and 𝐛𝐛\mathbf{b}bold_b is a translation vector), the spatial component of the output transforms in the same way (as 𝐱u𝐑𝐱u+𝐛maps-tosubscriptsuperscript𝐱𝑢subscriptsuperscript𝐑𝐱𝑢𝐛\mathbf{x}^{\prime}_{u}\mapsto\mathbf{R}\mathbf{x}^{\prime}_{u}+\mathbf{b}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ↦ bold_Rx start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + bold_b), whereas 𝐟usubscriptsuperscript𝐟𝑢\mathbf{f}^{\prime}_{u}bold_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT remains invariant.

Much like the space of permutation equivariant functions we discussed before in the context of general graphs, there exists a vast amount of E(3)E3\mathrm{E}(3)roman_E ( 3 )-equivariant layers that would satisfy the constraints above—but not all of these layers would be geometrically stable, or easy to implement. In fact, the space of practically useful equivariant layers may well be easily described by a simple categorisation, not unlike our “three flavours” of spatial GNN layers. One elegant solution was suggested by Satorras et al. (2021) in the form of equivariant message passing. Their model operates as follows:

𝐟usubscriptsuperscript𝐟𝑢\displaystyle\mathbf{f}^{\prime}_{u}bold_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =\displaystyle== ϕ(𝐟u,v𝒩uψf(𝐟u,𝐟v,𝐱u𝐱v2)),italic-ϕsubscript𝐟𝑢subscriptdirect-sum𝑣subscript𝒩𝑢subscript𝜓fsubscript𝐟𝑢subscript𝐟𝑣superscriptnormsubscript𝐱𝑢subscript𝐱𝑣2\displaystyle\phi\left(\mathbf{f}_{u},\bigoplus\limits_{v\in\mathcal{N}_{u}}\psi_{\mathrm{f}}(\mathbf{f}_{u},\mathbf{f}_{v},\|\mathbf{x}_{u}-\mathbf{x}_{v}\|^{2})\right),italic_ϕ ( bold_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ( bold_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ∥ bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ,
𝐱usubscriptsuperscript𝐱𝑢\displaystyle\mathbf{x}^{\prime}_{u}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =\displaystyle== 𝐱u+vu(𝐱u𝐱v)ψc(𝐟u,𝐟v,𝐱u𝐱v2)subscript𝐱𝑢subscript𝑣𝑢subscript𝐱𝑢subscript𝐱𝑣subscript𝜓csubscript𝐟𝑢subscript𝐟𝑣superscriptnormsubscript𝐱𝑢subscript𝐱𝑣2\displaystyle\mathbf{x}_{u}+\sum_{v\neq u}(\mathbf{x}_{u}-\mathbf{x}_{v})\psi_{\mathrm{c}}(\mathbf{f}_{u},\mathbf{f}_{v},\|\mathbf{x}_{u}-\mathbf{x}_{v}\|^{2})bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_v ≠ italic_u end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_ψ start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ∥ bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

where ψfsubscript𝜓f\psi_{\mathrm{f}}italic_ψ start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and ψcsubscript𝜓c\psi_{\mathrm{c}}italic_ψ start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT are two distinct (learnable) functions. It can be shown that such an aggregation is equivariant under Euclidean transformations of the spatial coordinates. This is due to the fact that the only dependence of 𝐟usubscriptsuperscript𝐟𝑢\mathbf{f}^{\prime}_{u}bold_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT on 𝐱usubscript𝐱𝑢\mathbf{x}_{u}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is through the distances 𝐱u𝐱v2superscriptnormsubscript𝐱𝑢subscript𝐱𝑣2\|\mathbf{x}_{u}-\mathbf{x}_{v}\|^{2}∥ bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and the action of E(3)E3\mathrm{E}(3)roman_E ( 3 ) necessarily leaves distances between nodes unchanged. Further, the computations of such a layer can be seen as a particular instance of the “message-passing” GNN flavour, hence they are efficient to implement.

To summarise, in contrast to ordinary GNNs, Satorras et al. (2021) enable the correct treatment of ‘coordinates’ for each point in the graph. They are now treated as a member of the E(3)E3\mathrm{E}(3)roman_E ( 3 ) group, which means the network outputs behave correctly under rotations, reflections and translations of the input.140140140[Uncaptioned image]
While scalar features (heatmap) do not change under rotations, vector features (arrows) may change direction. The simple E(3)E3\mathrm{E}(3)roman_E ( 3 ) equivariant GNN given before does not take this into account.
The features, 𝐟usubscript𝐟𝑢\mathbf{f}_{u}bold_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, however, are treated in a channel-wise manner and still assumed to be scalars that do not change under these transformations. This limits the type of spatial information that can be captured within such a framework. For example, it may be desirable for some features to be encoded as vectors—e.g. point velocities—which should change direction under such transformations. Satorras et al. (2021) partially alleviate this issue by introducing the concept of velocities in one variant of their architecture. Velocities are a 3D vector property of each point which rotates appropriately. However, this is only a small subspace of the general representations that could be learned with an E(3)E3\mathrm{E}(3)roman_E ( 3 ) equivariant network. In general, node features may encode tensors of arbitrary dimensionality that would still transform according to E(3)E3\mathrm{E}(3)roman_E ( 3 ) in a well-defined manner.

Hence, while the architecture discussed above already presents an elegant equivariant solution for many practical input representations, in some cases it may be desirable to explore a broader collection of functions that satisfy the equivariance property. Existing methods dealing with such settings can be categorised into two classes: irreducible representations (of which the previously mentioned layer is a simplified instance) and regular representations. We briefly survey them here, leaving detailed discussion to future work.

Irreducible representations

Irreducible representations build on the finding that all elements of the roto-translation group can be brought into an irreducible form: a vector that is rotated by a block diagonal matrix. Crucially, each of those blocks is a Wigner D-matrix (the aforementioned Fourier basis for Spherical CNNs). Approaches under this umbrella map from one set of irreducible representations to another using equivariant kernels. To find the full set of equivariant mappings, one can then directly solve the equivariance constraint over these kernels. The solutions form a linear combination of equivariant basis matrices derived by Clebsch-Gordan matrices and the spherical harmonics.

Early examples of the irreducible representations approach include Tensor Field Networks (Thomas et al., 2018) and 3D Steerable CNNs (Weiler et al., 2018), both convolutional models operating on point clouds. The SE(3)SE3\mathrm{SE}(3)roman_SE ( 3 )-Transformer of Fuchs et al. (2020) extends this framework to the graph domain, using an attentional layer rather than convolutional. Further, while our discussion focused on the special case solution of Satorras et al. (2021), we note that the motivation for rotation or translation equivariant predictions over graphs had historically been explored in other fields, including architectures such as Dynamic Graph CNN (Wang et al., 2019b) for point clouds and efficient message passing models for quantum chemistry, such as SchNet (Schütt et al., 2018) and DimeNet (Klicpera et al., 2020).

Regular representations

While the approach of irreducible representations is attractive, it requires directly reasoning about the underlying group representations, which may be tedious, and only applicable to groups that are compact. Regular representation approaches are more general, but come with an additional computational burden – for exact equivariance they require storing copies of latent feature embeddings for all group elements141141141This approach was, in fact, pioneered by the group convolutional neural networks we presented in previous sections..

One promising approach in this space aims to observe equivariance to Lie groups—through definitions of exponential and logarithmic maps—with the promise of rapid prototyping across various symmetry groups. While Lie groups are out of scope for this section, we refer the reader to two recent successful instances of this direction: the LieConv of Finzi et al. (2020), and the LieTransformer of Hutchinson et al. (2020).

The approaches covered in this section represent popular ways of processing data on geometric graphs in an way that is explicitly equivariant to the underlying geometry. As discussed in Section 4.6, meshes are a special instance of geometric graphs that can be understood as discretisations of continuous surfaces. We will study mesh-specific equivariant neural networks next.

5.6 Intrinsic Mesh CNNs

Meshes, in particular, triangular ones, are the ‘bread and butter’ of computer graphics and perhaps the most common way of modeling 3D objects. The remarkable success of deep learning in general and CNNs in computer vision in particular has lead to a keen interest in the graphics and geometry processing community around the mid-2010s142142142 [Uncaptioned image] Examples of geodesic patches. In order for the resulting patch to be a topological disk, its radius R𝑅Ritalic_R must be smaller than the injectivity radius. to construct similar architectures for mesh data.

Geodesic patches

Most of the architectures for deep learning on meshes implement convolutional filters of the form (21) by discretising or approximating the exponential map and expressing the filter in a coordinate system of the tangent plane. Shooting a geodesic γ:[0,T]Ω:𝛾0𝑇Ω\gamma:[0,T]\rightarrow\Omegaitalic_γ : [ 0 , italic_T ] → roman_Ω from a point u=γ(0)𝑢𝛾0u=\gamma(0)italic_u = italic_γ ( 0 ) to nearby point v=γ(T)𝑣𝛾𝑇v=\gamma(T)italic_v = italic_γ ( italic_T ) defines a local system of geodesic polar coordinates (r(u,v),ϑ(u,v))𝑟𝑢𝑣italic-ϑ𝑢𝑣(r(u,v),\vartheta(u,v))( italic_r ( italic_u , italic_v ) , italic_ϑ ( italic_u , italic_v ) ) where r𝑟ritalic_r is the geodesic distance between u𝑢uitalic_u and v𝑣vitalic_v (length of the geodesic γ𝛾\gammaitalic_γ) and ϑitalic-ϑ\varthetaitalic_ϑ is the angle between γ(0)superscript𝛾0\gamma^{\prime}(0)italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 0 ) and some local reference direction. This allows to define a geodesic patch x(u,r,ϑ)=x(expuω~(r,ϑ))𝑥𝑢𝑟italic-ϑ𝑥subscript𝑢~𝜔𝑟italic-ϑx(u,r,\vartheta)=x(\exp_{u}\tilde{\omega}(r,\vartheta))italic_x ( italic_u , italic_r , italic_ϑ ) = italic_x ( roman_exp start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG ( italic_r , italic_ϑ ) ), where ω~u:[0,R]×[0,2π)TuΩ:subscript~𝜔𝑢0𝑅02𝜋subscript𝑇𝑢Ω\tilde{\omega}_{u}:[0,R]\times[0,2\pi)\rightarrow T_{u}\Omegaover~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : [ 0 , italic_R ] × [ 0 , 2 italic_π ) → italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Ω is the local polar frame.

On a surface143143143 [Uncaptioned image] Construction of discrete geodesics on a mesh. discretised as a mesh, a geodesic is a poly-line that traverses the triangular faces. Traditionally, geodesics have been computed using the Fast Marching algorithm Kimmel and Sethian (1998), an efficient numerical approximation of a nonlinear PDE called the eikonal equation encountered in physical models of wave propagation in a medium. This scheme was adapted by Kokkinos et al. (2012) for the computation of local geodesic patches and later reused by Masci et al. (2015) for the construction of Geodesic CNNs, the first intrinsic CNN-like architectures on meshes.

Isotropic filters

Importantly, in the definition of the geodesic patch we have ambiguity in the choice of the reference direction and the patch orientation. This is exactly the ambiguity of the choice of the gauge, and our local system of coordinates is defined up to arbitrary rotation (or a shift in the angular coordinate, x(u,r,ϑ+ϑ0)𝑥𝑢𝑟italic-ϑsubscriptitalic-ϑ0x(u,r,\vartheta+\vartheta_{0})italic_x ( italic_u , italic_r , italic_ϑ + italic_ϑ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )), which can be different at every node. Perhaps the most straightforward solution is to use isotropic filters of the form θ(r)𝜃𝑟\theta(r)italic_θ ( italic_r ) that perform a direction-independent aggregation of the neighbour features,

(xθ)(u)=0R02πx(u,r,ϑ)θ(r)drdϑ.𝑥𝜃𝑢superscriptsubscript0𝑅superscriptsubscript02𝜋𝑥𝑢𝑟italic-ϑ𝜃𝑟differential-d𝑟differential-ditalic-ϑ(x\star\theta)(u)=\int_{0}^{R}\int_{0}^{2\pi}x(u,r,\vartheta)\theta(r)\mathrm{d}r\mathrm{d}\vartheta.( italic_x ⋆ italic_θ ) ( italic_u ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT italic_x ( italic_u , italic_r , italic_ϑ ) italic_θ ( italic_r ) roman_d italic_r roman_d italic_ϑ .

Spectral filters discussed in Sections 4.44.6 fall under this category: they are based on the Laplacian operator, which is isotropic. Such an approach, however, discards important directional information, and might fail to extract edge-like features.

Fixed gauge

An alternative, to which we have already alluded in Section 4.4, is to fix some gauge. Monti et al. (2017) used the principal curvature directions: while this choice is not intrinsic and may ambiguous at flat points (where curvature vanishes) or uniform curvature (such as on a perfect sphere), the authors showed that it is reasonable for dealing with deformable human body shapes, which are approximately piecewise-rigid. Later works, e.g. Melzi et al. (2019), showed reliable intrinsic construction of gauges on meshes, computed as (intrinsic) gradients of intrinsic functions. While such tangent fields might have singularities (i.e., vanish at some points), the overall procedure is very robust to noise and remeshing.

Angular pooling

Another approach, referred to as angular max pooling, was used by Masci et al. (2015). In this case, the filter θ(r,ϑ)𝜃𝑟italic-ϑ\theta(r,\vartheta)italic_θ ( italic_r , italic_ϑ ) is anisotropic, but its matching with the function is performed over all the possible rotations, which are then aggregated:

(xθ)(u)=maxϑ0[0,2π)0R02πx(u,r,ϑ)θ(r,ϑ+ϑ0)drdϑ.𝑥𝜃𝑢subscriptsubscriptitalic-ϑ002𝜋superscriptsubscript0𝑅superscriptsubscript02𝜋𝑥𝑢𝑟italic-ϑ𝜃𝑟italic-ϑsubscriptitalic-ϑ0differential-d𝑟differential-ditalic-ϑ(x\star\theta)(u)=\max_{\vartheta_{0}\in[0,2\pi)}\,\,\int_{0}^{R}\int_{0}^{2\pi}x(u,r,\vartheta)\theta(r,\vartheta+\vartheta_{0})\mathrm{d}r\mathrm{d}\vartheta.( italic_x ⋆ italic_θ ) ( italic_u ) = roman_max start_POSTSUBSCRIPT italic_ϑ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT italic_x ( italic_u , italic_r , italic_ϑ ) italic_θ ( italic_r , italic_ϑ + italic_ϑ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_r roman_d italic_ϑ .

Conceptually, this can be visualised as correlating geodesic patches with a rotating filter and collecting the strongest responses.

On meshes, the continuous integrals can be discretised using a construction referred to as patch operators (Masci et al., 2015). In a geodesic patch around node u𝑢uitalic_u, the neighbour nodes 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT,144144144Typically multi-hop neighbours are used. represented in the local polar coordinates as (ruv,ϑuv)subscript𝑟𝑢𝑣subscriptitalic-ϑ𝑢𝑣(r_{uv},\vartheta_{uv})( italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ), are weighted by a set of weighting functions w1(r,ϑ),,wK(r,ϑ)subscript𝑤1𝑟italic-ϑsubscript𝑤𝐾𝑟italic-ϑw_{1}(r,\vartheta),\ldots,w_{K}(r,\vartheta)italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_r , italic_ϑ ) , … , italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_r , italic_ϑ ) (shown in Figure 18 and acting as ‘soft pixels’) and aggregated,

(xθ)u=k=1Kwkv𝒩u(ruv,ϑuv)xvθkk=1Kwkv𝒩u(ruv,ϑuv)θksubscript𝑥𝜃𝑢superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝑣subscript𝒩𝑢subscript𝑟𝑢𝑣subscriptitalic-ϑ𝑢𝑣subscript𝑥𝑣subscript𝜃𝑘superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝑣subscript𝒩𝑢subscript𝑟𝑢𝑣subscriptitalic-ϑ𝑢𝑣subscript𝜃𝑘(x\star\theta)_{u}=\frac{\sum_{k=1}^{K}w_{k}\sum_{v\in\mathcal{N}_{u}}(r_{uv},\vartheta_{uv})x_{v}\,\theta_{k}}{\sum_{k=1}^{K}w_{k}\sum_{v\in\mathcal{N}_{u}}(r_{uv},\vartheta_{uv})\theta_{k}}( italic_x ⋆ italic_θ ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG

(here θ1,,θKsubscript𝜃1subscript𝜃𝐾\theta_{1},\ldots,\theta_{K}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are the learnable coefficients of the filter). Multi-channel features are treated channel-wise, with a family of appropriate filters. Masci et al. (2015); Boscaini et al. (2016a) used pre-defined weighting functions w𝑤witalic_w, while Monti et al. (2017) further allowed them to be learnable.

Refer to caption
Figure 18: Left-to-right: examples of patch operators used in Geodesic CNN (Masci et al., 2015), Anisotropic CNN (Boscaini et al., 2016b) and MoNet (Monti et al., 2017), with the level sets of the weighting functions wk(r,ϑ)subscript𝑤𝑘𝑟italic-ϑw_{k}(r,\vartheta)italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_r , italic_ϑ ) shown in red.

Gauge-equivariant filters

Both isotropic filters and angular max pooling lead to features that are invariant to gauge transformations; they transform according to the trivial representation ρ(𝔤)=1𝜌𝔤1\rho({\mathfrak{g}})=1italic_ρ ( fraktur_g ) = 1 (where 𝔤SO(2)𝔤SO2{\mathfrak{g}}\in\mathrm{SO}(2)fraktur_g ∈ roman_SO ( 2 ) is a rotation of the local coordinate frame). This point of view suggests another approach, proposed by Cohen et al. (2019); de Haan et al. (2020) and discussed in Section 4.5, where the features computed by the network are associated with an arbitrary representation ρ𝜌\rhoitalic_ρ of the structure group 𝔊𝔊{\mathfrak{G}}fraktur_G (e.g. SO(2)SO2\mathrm{SO(2)}roman_SO ( 2 ) or O(2)O2\mathrm{O(2)}roman_O ( 2 ) of rotations or rotations+reflections of the coordinate frame, respectively). Tangent vectors transform according to the standard representation ρ(𝔤)=𝔤𝜌𝔤𝔤\rho({\mathfrak{g}})={\mathfrak{g}}italic_ρ ( fraktur_g ) = fraktur_g. As another example, the feature vector obtained by matching n𝑛nitalic_n rotated copies of the same filter transforms by cyclic shifts under rotations of the gauge; this is known as the regular representation of the cyclic group Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

As discussed in Section 4.5, when dealing with such geometric features (associated to a non-trivial representation), we must first parallel transport them to the same vector space before applying the filter. On a mesh, this can be implemented via the following message passing mechanism described by de Haan et al. (2020). Let 𝐱udsubscript𝐱𝑢superscript𝑑\mathbf{x}_{u}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a d𝑑ditalic_d-dimensional input feature at mesh node u𝑢uitalic_u. This feature is expressed relative to an (arbitrary) choice of gauge at u𝑢uitalic_u, and is assumed to transform according to a representation ρinsubscript𝜌in\rho_{\textup{in}}italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT of 𝔊=SO(2)𝔊SO2{\mathfrak{G}}=\operatorname{SO}(2)fraktur_G = roman_SO ( 2 ) under rotations of the gauge. Similarly, the output features 𝐡usubscript𝐡𝑢\mathbf{h}_{u}bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of the mesh convolution are dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT dimensional and should transform according to ρoutsubscript𝜌out\rho_{\textup{out}}italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT (which can be chosen at will by the network designer).

By analogy to Graph Neural Networks, we can implement the gauge-equivariant convolution (23) on meshes by sending messages from the neighbours 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of u𝑢uitalic_u (and from u𝑢uitalic_u itself) to u𝑢uitalic_u:

𝐡u=𝚯self𝐱u+v𝒩u𝚯neigh(ϑuv)ρ(𝔤vu)𝐱v,subscript𝐡𝑢subscript𝚯selfsubscript𝐱𝑢subscript𝑣subscript𝒩𝑢subscript𝚯neighsubscriptitalic-ϑ𝑢𝑣𝜌subscript𝔤𝑣𝑢subscript𝐱𝑣\mathbf{h}_{u}=\bm{\Theta}_{\textup{self}}\;\mathbf{x}_{u}+\sum_{v\in\mathcal{N}_{u}}\bm{\Theta}_{\textup{neigh}}(\vartheta_{uv})\rho({\mathfrak{g}}_{v\rightarrow u})\mathbf{x}_{v},bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT ( italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (37)

where 𝚯self,𝚯neigh(ϑuv)d×dsubscript𝚯selfsubscript𝚯neighsubscriptitalic-ϑ𝑢𝑣superscriptsuperscript𝑑𝑑\bm{\Theta}_{\textup{self}},\bm{\Theta}_{\textup{neigh}}(\vartheta_{uv})\in\mathbb{R}^{d^{\prime}\times d}bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT , bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT ( italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT are learned filter matrices. The structure group element 𝔤vuSO(2)subscript𝔤𝑣𝑢SO2{\mathfrak{g}}_{v\rightarrow u}\in\operatorname{SO}(2)fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ∈ roman_SO ( 2 ) denotes the effect of parallel transport from v𝑣vitalic_v to u𝑢uitalic_u, expressed relative to the gauges at u𝑢uitalic_u and v𝑣vitalic_v, and can be precomputed for each mesh. Its action is encoded by a transporter matrix ρ(𝔤vu)d×d𝜌subscript𝔤𝑣𝑢superscript𝑑𝑑\rho({\mathfrak{g}}_{v\rightarrow u})\in\mathbb{R}^{d\times d}italic_ρ ( fraktur_g start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT.145145145Note that d𝑑ditalic_d is the feature dimension and is not necessarily equal to 2, the dimension of the mesh. The matrix 𝚯neigh(ϑuv)subscript𝚯neighsubscriptitalic-ϑ𝑢𝑣\bm{\Theta}_{\textup{neigh}}(\vartheta_{uv})bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT ( italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) depends on the angle ϑuvsubscriptitalic-ϑ𝑢𝑣\vartheta_{uv}italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT of the neighbour v𝑣vitalic_v to the reference direction (e.g. first axis of the frame) at u𝑢uitalic_u, so this kernel is anisotropic: different neighbours are treated differently.

As explained in Section 4.5, for 𝐡(u)𝐡𝑢\mathbf{h}(u)bold_h ( italic_u ) to be a well-defined geometric quantity, it should transform as 𝐡(u)ρout(𝔤1(u))𝐡(u)maps-to𝐡𝑢subscript𝜌outsuperscript𝔤1𝑢𝐡𝑢\mathbf{h}(u)\mapsto\rho_{\textup{out}}({\mathfrak{g}}^{-1}(u))\mathbf{h}(u)bold_h ( italic_u ) ↦ italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( fraktur_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ) ) bold_h ( italic_u ) under gauge transformations. This will be the case when 𝚯selfρin(ϑ)=ρout(ϑ)𝚯selfsubscript𝚯selfsubscript𝜌initalic-ϑsubscript𝜌outitalic-ϑsubscript𝚯self\bm{\Theta}_{\textup{self}}\rho_{\textup{in}}(\vartheta)=\rho_{\textup{out}}(\vartheta)\bm{\Theta}_{\textup{self}}bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( italic_ϑ ) = italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( italic_ϑ ) bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT for all ϑSO(2)italic-ϑSO2\vartheta\in\operatorname{SO}(2)italic_ϑ ∈ roman_SO ( 2 ),146146146Here we abuse the notation, identifying 2D rotations with angles ϑitalic-ϑ\varthetaitalic_ϑ. and 𝚯neigh(ϑuvϑ)ρin(ϑ)=ρout(ϑ)𝚯neigh(ϑuv)subscript𝚯neighsubscriptitalic-ϑ𝑢𝑣italic-ϑsubscript𝜌initalic-ϑsubscript𝜌outitalic-ϑsubscript𝚯neighsubscriptitalic-ϑ𝑢𝑣\bm{\Theta}_{\textup{neigh}}(\vartheta_{uv}-\vartheta)\rho_{\textup{in}}(\vartheta)=\rho_{\textup{out}}(\vartheta)\bm{\Theta}_{\textup{neigh}}(\vartheta_{uv})bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT ( italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT - italic_ϑ ) italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( italic_ϑ ) = italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( italic_ϑ ) bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT ( italic_ϑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ). Since these constraints are linear, the space of matrices 𝚯selfsubscript𝚯self\bm{\Theta}_{\textup{self}}bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT and matrix-valued functions 𝚯neighsubscript𝚯neigh\bm{\Theta}_{\textup{neigh}}bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT satisfying these constraints is a linear subspace, and so we can parameterise them as a linear combination of basis kernels with learnable coefficients: 𝚯self=iαi𝚯selfisubscript𝚯selfsubscript𝑖subscript𝛼𝑖superscriptsubscript𝚯self𝑖\bm{\Theta}_{\textup{self}}=\sum_{i}\alpha_{i}\bm{\Theta}_{\textup{self}}^{i}bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Θ start_POSTSUBSCRIPT self end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝚯neigh=iβi𝚯neighisubscript𝚯neighsubscript𝑖subscript𝛽𝑖superscriptsubscript𝚯neigh𝑖\bm{\Theta}_{\textup{neigh}}=\sum_{i}\beta_{i}\bm{\Theta}_{\textup{neigh}}^{i}bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Θ start_POSTSUBSCRIPT neigh end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

5.7 Recurrent Neural Networks

Our discussion has thus far always assumed the inputs to be solely spatial across a given domain. However, in many common use cases, the inputs can also be considered sequential (e.g. video, text or speech). In this case, we assume that the input consists of arbitrarily many steps, wherein at each step t𝑡titalic_t we are provided with an input signal, which we represent as 𝐗(t)𝒳(Ω(t))superscript𝐗𝑡𝒳superscriptΩ𝑡\mathbf{X}^{(t)}\in\mathcal{X}(\Omega^{(t)})bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ caligraphic_X ( roman_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ).147147147Whether the domain is considered static or dynamic concerns time scales: e.g., a road network does change over time (as new roads are built and old ones are demolished), but significantly slower compared to the flow of traffic. Similarly, in social networks, changes in engagement (e.g. Twitter users re-tweeting a tweet) happen at a much higher frequency than changes in the follow graph.

While in general the domain can evolve in time together with the signals on it, it is typically assumed that the domain is kept fixed across all the t𝑡titalic_t, i.e. Ω(t)=ΩsuperscriptΩ𝑡Ω\Omega^{(t)}=\Omegaroman_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Ω. Here, we will exclusively focus on this case, but note that exceptions are common. Social networks are an example where one often has to account for the domain changing through time, as new links are regularly created as well as erased. The domain in this setting is often referred to as a dynamic graph (Xu et al., 2020a; Rossi et al., 2020).

Often, the individual 𝐗(t)superscript𝐗𝑡\mathbf{X}^{(t)}bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT inputs will exhibit useful symmetries and hence may be nontrivially treated by any of our previously discussed architectures. Some common examples include: videos (ΩΩ\Omegaroman_Ω is a fixed grid, and signals are a sequence of frames); fMRI scans (ΩΩ\Omegaroman_Ω is a fixed mesh representing the geometry of the brain cortex, where different regions are activated at different times as a response to presented stimuli); and traffic flow networks (ΩΩ\Omegaroman_Ω is a fixed graph representing the road network, on which e.g. the average traffic speed is recorded at various nodes).

Let us assume an encoder function f(𝐗(t))𝑓superscript𝐗𝑡f(\mathbf{X}^{(t)})italic_f ( bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) providing latent representations at the level of granularity appropriate for the problem and respectful of the symmetries of the input domain. As an example148148148We do not lose generality in our example; equivalent analysis can be done e.g. for node-level outputs on a spatiotemporal graph; the only difference is in the choice of encoder f𝑓fitalic_f (which will then be a permutation equivariant GNN)., consider processing video frames: that is, at each timestep, we are given a grid-structured input represented as an n×d𝑛𝑑n\times ditalic_n × italic_d matrix 𝐗(t)superscript𝐗𝑡\mathbf{X}^{(t)}bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, where n𝑛nitalic_n is the number of pixels (fixed in time) and d𝑑ditalic_d is the number of input channels (e.g. d=3𝑑3d=3italic_d = 3 for RGB frames). Further, we are interested in analysis at the level of entire frames, in which case it is appropriate to implement f𝑓fitalic_f as a translation invariant CNN, outputting a k𝑘kitalic_k-dimensional representation 𝐳(t)=f(𝐗(t))superscript𝐳𝑡𝑓superscript𝐗𝑡\mathbf{z}^{(t)}=f(\mathbf{X}^{(t)})bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_f ( bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) of the frame at time-step t𝑡titalic_t.

We are now left with the task of appropriately summarising a sequence of vectors 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT across all the steps. A canonical way to dynamically aggregate this information in a way that respects the temporal progression of inputs and also easily allows for online arrival of novel data-points, is using a Recurrent Neural Network (RNN). 149149149Note that the 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT vectors can be seen as points on a temporal grid: hence, processing them with a CNN is also viable in some cases. Transformers are also increasingly popular models for processing generic sequential inputs.What we will show here is that RNNs are an interesting geometric architecture to study in their own right, since they implement a rather unusual type of symmetry over the inputs 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

SimpleRNNs

Refer to caption
Figure 19: Illustration of processing video input with RNNs. Each input video frame 𝐗(t)superscript𝐗𝑡\mathbf{X}^{(t)}bold_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is processed using a shared function f𝑓fitalic_f—e.g. a translation invariant CNN—into a flat representation 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Then the RNN update function R𝑅Ritalic_R is iterated across these vectors, iteratively updating a summary vector 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT which summarises all the inputs up to and including 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The computation is seeded with an initial summary vector 𝐡(0)superscript𝐡0\mathbf{h}^{(0)}bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, which may be either pre-determined or learnable.

At each step, the recurrent neural network computes an m𝑚mitalic_m-dimensional summary vector 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of all the input steps up to and including t𝑡titalic_t. This (partial) summary is computed conditional on the current step’s features and the previous step’s summary, through a shared update function, R:k×mm:𝑅superscript𝑘superscript𝑚superscript𝑚R:\mathbb{R}^{k}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{m}italic_R : blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, as follows (see Figure 19 for a summary):

𝐡(t)=R(𝐳(t),𝐡(t1))superscript𝐡𝑡𝑅superscript𝐳𝑡superscript𝐡𝑡1\mathbf{h}^{(t)}=R(\mathbf{z}^{(t)},\mathbf{h}^{(t-1)})bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) (38)

and, as both 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT are flat vector representations, R𝑅Ritalic_R may be most easily expressed as a single fully-connected neural network layer (often known as SimpleRNN150150150In spite of their name, SimpleRNNs are remarkably expressive. For example, it was shown by Siegelmann and Sontag (1995) that such models are Turing-complete, meaning that they can likely represent any computation we may ever be able to execute on computers.; see Elman (1990); Jordan (1997)):

𝐡(t)=σ(𝐖𝐳(t)+𝐔𝐡(t1)+𝐛)superscript𝐡𝑡𝜎superscript𝐖𝐳𝑡superscript𝐔𝐡𝑡1𝐛\mathbf{h}^{(t)}=\sigma(\mathbf{W}\mathbf{z}^{(t)}+\mathbf{U}\mathbf{h}^{(t-1)}+\mathbf{b})bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_σ ( bold_Wz start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_Uh start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + bold_b ) (39)

where 𝐖k×m𝐖superscript𝑘𝑚\mathbf{W}\in\mathbb{R}^{k\times m}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_m end_POSTSUPERSCRIPT, 𝐔m×m𝐔superscript𝑚𝑚\mathbf{U}\in\mathbb{R}^{m\times m}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT and 𝐛m𝐛superscript𝑚\mathbf{b}\in\mathbb{R}^{m}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are learnable parameters, and σ𝜎\sigmaitalic_σ is an activation function. While this introduces loops in the network’s computational graph, in practice the network is unrolled for an appropriate number of steps, allowing for backpropagation through time (Robinson and Fallside, 1987; Werbos, 1988; Mozer, 1989) to be applied.

The summary vectors may then be appropriately leveraged for the downstream task—if a prediction is required at every step of the sequence, then a shared predictor may be applied to each 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT individually. For classifying entire sequences, typically the final summary, 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT, is passed to a classifier. Here, T𝑇Titalic_T is the length of the sequence.

Specially, the initial summary vector is usually either set to the zero-vector, i.e. 𝐡(0)=𝟎superscript𝐡00\mathbf{h}^{(0)}=\mathbf{0}bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_0, or it is made learnable. Analysing the manner in which the initial summary vector is set also allows us to deduce an interesting form of translation equivariance exhibited by RNNs.

Translation equivariance in RNNs

Since we interpret the individual steps t𝑡titalic_t as discrete time-steps, the input vectors 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT can be seen as living on a one-dimensional151151151Note that this construction is extendable to grids in higher dimensions, allowing us to, e.g., process signals living on images in a scanline fashion. Such a construction powered a popular series of models, such as the PixelRNN from van den Oord et al. (2016b).grid of time-steps. While it might be attractive to attempt extending our translation equivariance analysis from CNNs here, it cannot be done in a trivial manner.

To see why, let us assume that we have produced a new sequence 𝐳(t)=𝐳(t+1)superscript𝐳𝑡superscript𝐳𝑡1\mathbf{z}^{\prime(t)}=\mathbf{z}^{(t+1)}bold_z start_POSTSUPERSCRIPT ′ ( italic_t ) end_POSTSUPERSCRIPT = bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT by performing a left-shift of our sequence by one step. It might be tempting to attempt showing 𝐡(t)=𝐡(t+1)superscript𝐡𝑡superscript𝐡𝑡1\mathbf{h}^{\prime(t)}=\mathbf{h}^{(t+1)}bold_h start_POSTSUPERSCRIPT ′ ( italic_t ) end_POSTSUPERSCRIPT = bold_h start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT, as one expects with translation equivariance; however, this will not generally hold. Consider t=1𝑡1t=1italic_t = 1; directly applying and expanding the update function, we recover the following:

𝐡(1)superscript𝐡1\displaystyle\mathbf{h}^{\prime(1)}bold_h start_POSTSUPERSCRIPT ′ ( 1 ) end_POSTSUPERSCRIPT =R(𝐳(1),𝐡(0))=R(𝐳(2),𝐡(0))absent𝑅superscript𝐳1superscript𝐡0𝑅superscript𝐳2superscript𝐡0\displaystyle=R(\mathbf{z}^{\prime(1)},\mathbf{h}^{(0)})=R(\mathbf{z}^{(2)},\mathbf{h}^{(0)})= italic_R ( bold_z start_POSTSUPERSCRIPT ′ ( 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = italic_R ( bold_z start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) (40)
𝐡(2)superscript𝐡2\displaystyle\mathbf{h}^{(2)}bold_h start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT =R(𝐳(2),𝐡(1))=R(𝐳(2),R(𝐳(1),𝐡(0)))absent𝑅superscript𝐳2superscript𝐡1𝑅superscript𝐳2𝑅superscript𝐳1superscript𝐡0\displaystyle=R(\mathbf{z}^{(2)},\mathbf{h}^{(1)})=R(\mathbf{z}^{(2)},R(\mathbf{z}^{(1)},\mathbf{h}^{(0)}))= italic_R ( bold_z start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_R ( bold_z start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_R ( bold_z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) (41)

Hence, unless we can guarantee that 𝐡(0)=R(𝐳(1),𝐡(0))superscript𝐡0𝑅superscript𝐳1superscript𝐡0\mathbf{h}^{(0)}=R(\mathbf{z}^{(1)},\mathbf{h}^{(0)})bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_R ( bold_z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ), we will not recover translation equivariance. Similar analysis can then be done for steps t>1𝑡1t>1italic_t > 1.

Fortunately, with a slight refactoring of how we represent 𝐳𝐳\mathbf{z}bold_z, and for a suitable choice of R𝑅Ritalic_R, it is possible to satisfy the equality above, and hence demonstrate a setting in which RNNs are equivariant to shifts. Our problem was largely one of boundary conditions: the equality above includes 𝐳(1)superscript𝐳1\mathbf{z}^{(1)}bold_z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, which our left-shift operation destroyed. To abstract this problem away, we will observe how an RNN processes an appropriately left-padded sequence, 𝐳¯(t)superscript¯𝐳𝑡\bar{\mathbf{z}}^{(t)}over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, defined as follows:

𝐳¯(t)={𝟎tt𝐳(tt)t>tsuperscript¯𝐳𝑡cases0𝑡superscript𝑡superscript𝐳𝑡superscript𝑡𝑡superscript𝑡\bar{\mathbf{z}}^{(t)}=\begin{cases}\mathbf{0}&t\leq t^{\prime}\\ \mathbf{z}^{(t-t^{\prime})}&t>t^{\prime}\end{cases}over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { start_ROW start_CELL bold_0 end_CELL start_CELL italic_t ≤ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_z start_POSTSUPERSCRIPT ( italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_t > italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW

Such a sequence now allows for left-shifting152152152Note that equivalent analyses will arise if we use a different padding vector than 𝟎0\mathbf{0}bold_0.by up to tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT steps without destroying any of the original input elements. Further, note we do not need to handle right-shifting separately; indeed, equivariance to right shifts naturally follows from the RNN equations.

We can now again analyse the operation of the RNN over a left-shifted verson of 𝐳¯(t)superscript¯𝐳𝑡\bar{\mathbf{z}}^{(t)}over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which we denote by 𝐳¯(t)=𝐳¯(t+1)superscript¯𝐳𝑡superscript¯𝐳𝑡1\bar{\mathbf{z}}^{\prime(t)}=\bar{\mathbf{z}}^{(t+1)}over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ′ ( italic_t ) end_POSTSUPERSCRIPT = over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT, as we did in Equations 4041:

𝐡(1)superscript𝐡1\displaystyle\mathbf{h}^{\prime(1)}bold_h start_POSTSUPERSCRIPT ′ ( 1 ) end_POSTSUPERSCRIPT =R(𝐳¯(1),𝐡(0))=R(𝐳¯(2),𝐡(0))absent𝑅superscript¯𝐳1superscript𝐡0𝑅superscript¯𝐳2superscript𝐡0\displaystyle=R(\bar{\mathbf{z}}^{\prime(1)},\mathbf{h}^{(0)})=R(\bar{\mathbf{z}}^{(2)},\mathbf{h}^{(0)})= italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ′ ( 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT )
𝐡(2)superscript𝐡2\displaystyle\mathbf{h}^{(2)}bold_h start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT =R(𝐳¯(2),𝐡(1))=R(𝐳¯(2),R(𝐳¯(1),𝐡(0)))=R(𝐳¯(2),R(𝟎,𝐡(0)))absent𝑅superscript¯𝐳2superscript𝐡1𝑅superscript¯𝐳2𝑅superscript¯𝐳1superscript𝐡0𝑅superscript¯𝐳2𝑅0superscript𝐡0\displaystyle=R(\bar{\mathbf{z}}^{(2)},\mathbf{h}^{(1)})=R(\bar{\mathbf{z}}^{(2)},R(\bar{\mathbf{z}}^{(1)},\mathbf{h}^{(0)}))=R(\bar{\mathbf{z}}^{(2)},R(\mathbf{0},\mathbf{h}^{(0)}))= italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) = italic_R ( over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_R ( bold_0 , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) )

where the substitution 𝐳¯(1)=𝟎superscript¯𝐳10\bar{\mathbf{z}}^{(1)}=\mathbf{0}over¯ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = bold_0 holds as long as t1superscript𝑡1t^{\prime}\geq 1italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 1, i.e. as long as any padding is applied153153153In a very similar vein, we can derive equivariance to left-shifting by s𝑠sitalic_s steps as long as tssuperscript𝑡𝑠t^{\prime}\geq sitalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_s.. Now, we can guarantee equivariance to left-shifting by one step (𝐡(t)=𝐡(t+1)superscript𝐡𝑡superscript𝐡𝑡1\mathbf{h}^{\prime(t)}=\mathbf{h}^{(t+1)}bold_h start_POSTSUPERSCRIPT ′ ( italic_t ) end_POSTSUPERSCRIPT = bold_h start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT) as long as 𝐡(0)=R(𝟎,𝐡(0))superscript𝐡0𝑅0superscript𝐡0\mathbf{h}^{(0)}=R(\mathbf{0},\mathbf{h}^{(0)})bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_R ( bold_0 , bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ).

Said differently, 𝐡(0)superscript𝐡0\mathbf{h}^{(0)}bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT must be chosen to be a fixed point of a function γ(𝐡)=R(𝟎,𝐡)𝛾𝐡𝑅0𝐡\gamma(\mathbf{h})=R(\mathbf{0},\mathbf{h})italic_γ ( bold_h ) = italic_R ( bold_0 , bold_h ). If the update function R𝑅Ritalic_R is conveniently chosen, then not only can we guarantee existence of such fixed points, but we can even directly obtain them by iterating the application of R𝑅Ritalic_R until convergence; e.g., as follows:

𝐡0=𝟎𝐡k+1=γ(𝐡k),formulae-sequencesubscript𝐡00subscript𝐡𝑘1𝛾subscript𝐡𝑘\mathbf{h}_{0}=\mathbf{0}\qquad\mathbf{h}_{k+1}=\gamma(\mathbf{h}_{k}),bold_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0 bold_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_γ ( bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (42)

where the index k𝑘kitalic_k refers to the iteration of R𝑅Ritalic_R in our computation, as opposed to the the index (t)𝑡(t)( italic_t ) denoting the time step of the RNN. If we choose R𝑅Ritalic_R such that γ𝛾\gammaitalic_γ is a contraction mapping154154154Contractions are functions γ:𝒳𝒳:𝛾𝒳𝒳\gamma:\mathcal{X}\rightarrow\mathcal{X}italic_γ : caligraphic_X → caligraphic_X such that, under some norm \|\cdot\|∥ ⋅ ∥ on 𝒳𝒳\mathcal{X}caligraphic_X, applying γ𝛾\gammaitalic_γ contracts the distances between points: for all 𝐱,𝐲𝒳𝐱𝐲𝒳\mathbf{x},\mathbf{y}\in\mathcal{X}bold_x , bold_y ∈ caligraphic_X, and some q[0,1)𝑞01q\in[0,1)italic_q ∈ [ 0 , 1 ), it holds that γ(𝐱)γ(𝐲)q𝐱𝐲norm𝛾𝐱𝛾𝐲𝑞norm𝐱𝐲\|\gamma(\mathbf{x})-\gamma(\mathbf{y})\|\leq q\|\mathbf{x}-\mathbf{y}\|∥ italic_γ ( bold_x ) - italic_γ ( bold_y ) ∥ ≤ italic_q ∥ bold_x - bold_y ∥. Iterating such a function then necessarily converges to a unique fixed point, as a direct consequence of Banach’s Fixed Point Theorem (Banach, 1922)., such an iteration will indeed converge to a unique fixed point. Accordingly, we can then iterate Equation (42) until 𝐡k+1=𝐡ksubscript𝐡𝑘1subscript𝐡𝑘\mathbf{h}_{k+1}=\mathbf{h}_{k}bold_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and we can set 𝐡(0)=𝐡ksuperscript𝐡0subscript𝐡𝑘\mathbf{h}^{(0)}=\mathbf{h}_{k}bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Note that this computation is equivalent to left-padding the sequence with “sufficiently many” zero-vectors.

Depth in RNNs

It is also easy to stack multiple RNNs—simply use the 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT vectors as an input sequence for a second RNN. This kind of construction is occasionally called a “deep RNN”, which is potentially misleading. Effectively, due to the repeated application of the recurrent operation, even a single RNN “layer” has depth equal to the number of input steps.

This often introduces uniquely challenging learning dynamics when optimising RNNs, as each training example induces many gradient updates to the shared parameters of the update network. Here we will focus on perhaps the most prominent such issue—that of vanishing and exploding gradients (Bengio et al., 1994)—which is especially problematic in RNNs, given their depth and parameter sharing. Further, it has single-handedly spurred some of the most influential research on RNNs. For a more detailed overview, we refer the reader to Pascanu et al. (2013), who have studied the training dynamics of RNNs in great detail, and exposed these challenges from a variety of perspectives: analytical, geometrical, and the lens of dynamical systems.

To illustrate vanishing gradients, consider a SimpleRNN with a sigmoidal activation function σ𝜎\sigmaitalic_σ155155155[Uncaptioned image]
Examples of such an activation include the logistic function, σ(x)=11+exp(x)𝜎𝑥11𝑥\sigma(x)=\frac{1}{1+\exp(-x)}italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_x ) end_ARG, and the hyperbolic tangent, σ(x)=tanhx𝜎𝑥𝑥\sigma(x)=\tanh xitalic_σ ( italic_x ) = roman_tanh italic_x. They are called sigmoidal due to the distinct S-shape of their plots.
, whose derivative magnitude |σ|superscript𝜎|\sigma^{\prime}|| italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | is always between 00 and 1111. Multiplying many such values results in gradients that quickly tend to zero, implying that early steps in the input sequence may not be able to have influence in updating the network parameters at all.

For example, consider the next-word prediction task (common in e.g. predictive keyboards), and the input text “Petar is Serbian. He was born on …[long paragraph] …Petar currently lives in  . Here, predicting the next word as “Serbia” may only be reasonably concluded by considering the very start of the paragraph—but gradients have likely vanished by the time they reach this input step, making learning from such examples very challenging.

Deep feedforward neural networks have also suffered from the vanishing gradient problem, until the invention of the ReLU activation (which has gradients equal to exactly zero or one—thus fixing the vanishing gradient problem). However, in RNNs, using ReLUs may easily lead to exploding gradients, as the output space of the update function is now unbounded, and gradient descent will update the cell once for every input step, quickly building up the scale of the updates. Historically, the vanishing gradient phenomenon was recognised early on as a significant obstacle in the use of recurrent networks. Coping with this problem motivated the development of more sophisticated RNN layers, which we describe next.

5.8 Long Short-Term Memory networks

Refer to caption
Figure 20: The dataflow of the long short-term memory (LSTM), with its components and memory cell (M𝑀Mitalic_M) clearly highlighted. Based on the current input 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, previous summary 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT and previous cell state 𝐜(t1)superscript𝐜𝑡1\mathbf{c}^{(t-1)}bold_c start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT, the LSTM predicts the updated cell state 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and summary 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

A key invention that significantly reduced the effects of vanishing gradients in RNNs is that of gating mechanisms, which allow the network to selectively overwrite information in a data-driven way. Prominent examples of these gated RNNs include the Long Short-Term Memory (LSTM; Hochreiter and Schmidhuber (1997)) and the Gated Recurrent Unit (GRU; Cho et al. (2014)). Here we will primarily discuss the LSTM—specifically, the variant presented by Graves (2013)—in order to illustrate the operations of such models. Concepts from LSTMs easily carry over to other gated RNNs.

Throughout this section, it will likely be useful to refer to Figure 20, which illustrates all of the LSTM operations that we will discuss in text.

The LSTM augments the recurrent computation by introducing a memory cell, which stores cell state vectors, 𝐜(t)msuperscript𝐜𝑡superscript𝑚\mathbf{c}^{(t)}\in\mathbb{R}^{m}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, that are preserved between computational steps. The LSTM computes summary vectors, 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, directly based on 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, and 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is, in turn, computed using 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT and 𝐜(t1)superscript𝐜𝑡1\mathbf{c}^{(t-1)}bold_c start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT. Critically, the cell is not completely overwritten based on 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT, which would expose the network to the same issues as the SimpleRNN. Instead, a certain quantity of the previous cell state may be retained—and the proportion by which this occurs is explicitly learned from data.

Just like in SimpleRNN, we compute features by using a single fully-connected neural network layer over the current input step and previous summary:156156156Note that we have set the activation function to tanh\tanhroman_tanh here; as LSTMs are designed to ameliorate the vanishing gradient problem, it is now appropriate to use a sigmoidal activation.

𝐜~(t)=tanh(𝐖c𝐳(t)+𝐔c𝐡(t1)+𝐛c)superscript~𝐜𝑡subscript𝐖𝑐superscript𝐳𝑡subscript𝐔𝑐superscript𝐡𝑡1subscript𝐛𝑐\widetilde{\mathbf{c}}^{(t)}=\tanh(\mathbf{W}_{c}\mathbf{z}^{(t)}+\mathbf{U}_{c}\mathbf{h}^{(t-1)}+\mathbf{b}_{c})over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_tanh ( bold_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (43)

But, as mentioned, we do not allow all of this vector to enter the cell—hence why we call it the vector of candidate features, and denote it as 𝐜~(t)superscript~𝐜𝑡\widetilde{\mathbf{c}}^{(t)}over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Instead, the LSTM directly learns gating vectors, which are real-valued vectors in the range [0,1]01[0,1][ 0 , 1 ], and decide how much of the signal should be allowed to enter, exit, and overwrite the memory cell.

Three such gates are computed, all based on 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT: the input gate 𝐢(t)superscript𝐢𝑡\mathbf{i}^{(t)}bold_i start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which computes the proportion of the candidate vector allowed to enter the cell; the forget gate 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which computes the proportion of the previous cell state to be retained, and the output gate 𝐨(t)superscript𝐨𝑡\mathbf{o}^{(t)}bold_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which computes the proportion of the new cell state to be used for the final summary vector. Typically all of these gates are also derived using a single fully connected layer, albeit with the logistic sigmoid activation logistic(x)=11+exp(x)logistic𝑥11𝑥\mathrm{logistic}(x)=\frac{1}{1+\exp(-x)}roman_logistic ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_x ) end_ARG, in order to guarantee that the outputs are in the [0,1]01[0,1][ 0 , 1 ] range157157157Note that the three gates are themselves vectors, i.e. 𝐢(t),𝐟(t),𝐨(t)[0,1]msuperscript𝐢𝑡superscript𝐟𝑡superscript𝐨𝑡superscript01𝑚\mathbf{i}^{(t)},\mathbf{f}^{(t)},\mathbf{o}^{(t)}\in[0,1]^{m}bold_i start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. This allows them to control how much each of the m𝑚mitalic_m dimensions is allowed through the gate.:

𝐢(t)superscript𝐢𝑡\displaystyle\mathbf{i}^{(t)}bold_i start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =logistic(𝐖i𝐳(t)+𝐔i𝐡(t1)+𝐛i)absentlogisticsubscript𝐖isuperscript𝐳𝑡subscript𝐔isuperscript𝐡𝑡1subscript𝐛i\displaystyle=\mathrm{logistic}(\mathbf{W}_{\mathrm{i}}\mathbf{z}^{(t)}+\mathbf{U}_{\mathrm{i}}\mathbf{h}^{(t-1)}+\mathbf{b}_{\mathrm{i}})= roman_logistic ( bold_W start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_U start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) (44)
𝐟(t)superscript𝐟𝑡\displaystyle\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =logistic(𝐖f𝐳(t)+𝐔f𝐡(t1)+𝐛f)absentlogisticsubscript𝐖fsuperscript𝐳𝑡subscript𝐔fsuperscript𝐡𝑡1subscript𝐛f\displaystyle=\mathrm{logistic}(\mathbf{W}_{\mathrm{f}}\mathbf{z}^{(t)}+\mathbf{U}_{\mathrm{f}}\mathbf{h}^{(t-1)}+\mathbf{b}_{\mathrm{f}})= roman_logistic ( bold_W start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_U start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ) (45)
𝐨(t)superscript𝐨𝑡\displaystyle\mathbf{o}^{(t)}bold_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =logistic(𝐖o𝐳(t)+𝐔o𝐡(t1)+𝐛o)absentlogisticsubscript𝐖osuperscript𝐳𝑡subscript𝐔osuperscript𝐡𝑡1subscript𝐛o\displaystyle=\mathrm{logistic}(\mathbf{W}_{\mathrm{o}}\mathbf{z}^{(t)}+\mathbf{U}_{\mathrm{o}}\mathbf{h}^{(t-1)}+\mathbf{b}_{\mathrm{o}})= roman_logistic ( bold_W start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_U start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT ) (46)

Finally, these gates are appropriately applied to decode the new cell state, 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which is then modulated by the output gate to produce the summary vector 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, as follows:

𝐜(t)superscript𝐜𝑡\displaystyle\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =𝐢(t)𝐜~(t)+𝐟(t)𝐜(t1)absentdirect-productsuperscript𝐢𝑡superscript~𝐜𝑡direct-productsuperscript𝐟𝑡superscript𝐜𝑡1\displaystyle=\mathbf{i}^{(t)}\odot\widetilde{\mathbf{c}}^{(t)}+\mathbf{f}^{(t)}\odot\mathbf{c}^{(t-1)}= bold_i start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊙ over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊙ bold_c start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT (47)
𝐡(t)superscript𝐡𝑡\displaystyle\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =𝐨(t)tanh(𝐜(t))absentdirect-productsuperscript𝐨𝑡superscript𝐜𝑡\displaystyle=\mathbf{o}^{(t)}\odot\tanh(\mathbf{c}^{(t)})= bold_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⊙ roman_tanh ( bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) (48)

where direct-product\odot is element-wise vector multiplication. Applied together, Equations (43)–(48) completely specify the update rule for the LSTM, which now takes into account the cell vector 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as well158158158This is still compatible with the RNN update blueprint from Equation (38); simply consider the summary vector to be the concatenation of 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐜(t)superscript𝐜𝑡\mathbf{c}^{(t)}bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT; sometimes denoted by 𝐡(t)𝐜(t)conditionalsuperscript𝐡𝑡superscript𝐜𝑡\mathbf{h}^{(t)}\|\mathbf{c}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.:

(𝐡(t),𝐜(t))=R(𝐳(t),(𝐡(t1),𝐜(t1)))superscript𝐡𝑡superscript𝐜𝑡𝑅superscript𝐳𝑡superscript𝐡𝑡1superscript𝐜𝑡1(\mathbf{h}^{(t)},\mathbf{c}^{(t)})=R(\mathbf{z}^{(t)},(\mathbf{h}^{(t-1)},\mathbf{c}^{(t-1)}))( bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ( bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , bold_c start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) )

Note that, as the values of 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are derived from 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐡(t1)superscript𝐡𝑡1\mathbf{h}^{(t-1)}bold_h start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT—and therefore directly learnable from data—the LSTM effectively learns how to appropriately forget past experiences. Indeed, the values of 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT directly appear in the backpropagation update for all the LSTM parameters (𝐖*,𝐔*,𝐛*subscript𝐖subscript𝐔subscript𝐛\mathbf{W}_{*},\mathbf{U}_{*},\mathbf{b}_{*}bold_W start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , bold_U start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT * end_POSTSUBSCRIPT), allowing the network to explicitly control, in a data-driven way, the degree of vanishing for the gradients across the time steps.

Besides tackling the vanishing gradient issue head-on, it turns out that gated RNNs also unlock a very useful form of invariance to time-warping transformations, which remains out of reach of SimpleRNNs.

Time warping invariance of gated RNNs

We will start by illustrating, in a continuous-time setting159159159We focus on the continuous setting as it will be easier to reason about manipulations of time there., what does it mean to warp time, and what is required of a recurrent model in order to achieve invariance to such transformations. Our exposition will largely follow the work of Tallec and Ollivier (2018), that initially described this phenomenon—and indeed, they were among the first to actually study RNNs from the lens of invariances.

Let us assume a continuous time-domain signal z(t)𝑧𝑡z(t)italic_z ( italic_t ), on which we would like to apply an RNN. To align the RNN’s discrete-time computation of summary vectors 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT160160160We will use h(t)𝑡h(t)italic_h ( italic_t ) to denote a continuous signal at time t𝑡titalic_t, and 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to denote a discrete signal at time-step t𝑡titalic_t.with an analogue in the continuous domain, h(t)𝑡h(t)italic_h ( italic_t ), we will observe its linear Taylor expansion:

h(t+δ)h(t)+δdh(t)dt𝑡𝛿𝑡𝛿d𝑡d𝑡h(t+\delta)\approx h(t)+\delta\frac{\mathrm{d}h(t)}{\mathrm{d}t}italic_h ( italic_t + italic_δ ) ≈ italic_h ( italic_t ) + italic_δ divide start_ARG roman_d italic_h ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG (49)

and, setting δ=1𝛿1\delta=1italic_δ = 1, we recover a relationship between h(t)𝑡h(t)italic_h ( italic_t ) and h(t+1)𝑡1h(t+1)italic_h ( italic_t + 1 ), which is exactly what the RNN update function R𝑅Ritalic_R (Equation 38) computes. Namely, the RNN update function satisfies the following differential equation:

dh(t)dt=h(t+1)h(t)=R(z(t+1),h(t))h(t)d𝑡d𝑡𝑡1𝑡𝑅𝑧𝑡1𝑡𝑡\frac{\mathrm{d}h(t)}{\mathrm{d}t}=h(t+1)-h(t)=R(z(t+1),h(t))-h(t)divide start_ARG roman_d italic_h ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG = italic_h ( italic_t + 1 ) - italic_h ( italic_t ) = italic_R ( italic_z ( italic_t + 1 ) , italic_h ( italic_t ) ) - italic_h ( italic_t ) (50)

We would like the RNN to be resilient to the way in which the signal is sampled (e.g. by changing the time unit of measurement), in order to account for any imperfections or irregularities therein. Formally, we denote a time warping161161161[Uncaptioned image]
Such warping operations can be simple, such as time rescaling; e.g. τ(t)=0.7t𝜏𝑡0.7𝑡\tau(t)=0.7titalic_τ ( italic_t ) = 0.7 italic_t (displayed above), which, in a discrete setting, would amount to new inputs being received every 1.43similar-toabsent1.43\sim 1.43∼ 1.43 steps. However, it also admits a wide spectrum of variably-changing sampling rates, e.g. sampling may freely accelerate or decelerate throughout the time domain.
operation τ:++:𝜏superscriptsuperscript\tau:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}italic_τ : blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, as any monotonically increasing differentiable mapping between times. The notation τ𝜏\tauitalic_τ is chosen because time warping represents an automorphism of time.

Further, we state that a class of models is invariant to time warping if, for any model of the class and any such τ𝜏\tauitalic_τ, there exists another (possibly the same) model from the class that processes the warped data in the same way as the original model did in the non-warped case.

This is a potentially very useful property. If we have an RNN class capable of modelling short-term dependencies well, and we can also show that this class is invariant to time warping, then we know it is possible to train such a model in a way that will usefully capture long-term dependencies as well (as they would correspond to a time dilation warping of a signal with short-term dependencies). As we will shortly see, it is no coincidence that gated RNN models such as the LSTM were proposed to model long-range dependencies. Achieving time warping invariance is tightly coupled with presence of gating mechanisms, such as the input/forget/output gates of LSTMs.

When time gets warped by τ𝜏\tauitalic_τ, the signal observed by the RNN at time t𝑡titalic_t is z(τ(t))𝑧𝜏𝑡z(\tau(t))italic_z ( italic_τ ( italic_t ) ) and, to remain invariant to such warpings, it should predict an equivalently-warped summary function h(τ(t))𝜏𝑡h(\tau(t))italic_h ( italic_τ ( italic_t ) ). Using Taylor expansion arguments once more, we derive a form of Equation 50 for the warped time, that the RNN update R𝑅Ritalic_R should satisfy:

dh(τ(t))dτ(t)=R(z(τ(t+1)),h(τ(t)))h(τ(t))d𝜏𝑡d𝜏𝑡𝑅𝑧𝜏𝑡1𝜏𝑡𝜏𝑡\frac{\mathrm{d}h(\tau(t))}{\mathrm{d}\tau(t)}=R(z(\tau(t+1)),h(\tau(t)))-h(\tau(t))divide start_ARG roman_d italic_h ( italic_τ ( italic_t ) ) end_ARG start_ARG roman_d italic_τ ( italic_t ) end_ARG = italic_R ( italic_z ( italic_τ ( italic_t + 1 ) ) , italic_h ( italic_τ ( italic_t ) ) ) - italic_h ( italic_τ ( italic_t ) ) (51)

However, the above derivative is computed with respect to the warped time τ(t)𝜏𝑡\tau(t)italic_τ ( italic_t ), and hence does not take into account the original signal. To make our model take into account the warping transformation explicitly, we need to differentiate the warped summary function with respect to t𝑡titalic_t. Applying the chain rule, this yields the following differential equation:

dh(τ(t))dt=dh(τ(t))dτ(t)dτ(t)dt=dτ(t)dtR(z(τ(t+1)),h(τ(t)))dτ(t)dth(τ(t))d𝜏𝑡d𝑡d𝜏𝑡d𝜏𝑡d𝜏𝑡d𝑡d𝜏𝑡d𝑡𝑅𝑧𝜏𝑡1𝜏𝑡d𝜏𝑡d𝑡𝜏𝑡\frac{\mathrm{d}h(\tau(t))}{\mathrm{d}t}=\frac{\mathrm{d}h(\tau(t))}{\mathrm{d}\tau(t)}\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}=\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}R(z(\tau(t+1)),h(\tau(t)))-\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}h(\tau(t))divide start_ARG roman_d italic_h ( italic_τ ( italic_t ) ) end_ARG start_ARG roman_d italic_t end_ARG = divide start_ARG roman_d italic_h ( italic_τ ( italic_t ) ) end_ARG start_ARG roman_d italic_τ ( italic_t ) end_ARG divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG = divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG italic_R ( italic_z ( italic_τ ( italic_t + 1 ) ) , italic_h ( italic_τ ( italic_t ) ) ) - divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG italic_h ( italic_τ ( italic_t ) ) (52)

and, for our (continuous-time) RNN to remain invariant to any time warping τ(t)𝜏𝑡\tau(t)italic_τ ( italic_t ), it needs to be able to explicitly represent the derivative dτ(t)dtd𝜏𝑡d𝑡\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG, which is not assumed known upfront! We need to introduce a learnable function ΓΓ\Gammaroman_Γ which approximates this derivative. For example, ΓΓ\Gammaroman_Γ could be a neural network taking into account z(t+1)𝑧𝑡1z(t+1)italic_z ( italic_t + 1 ) and h(t)𝑡h(t)italic_h ( italic_t ) and predicting scalar outputs.

Now, remark that, from the point of view of a discrete RNN model under time warping, its input 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT will correspond to z(τ(t))𝑧𝜏𝑡z(\tau(t))italic_z ( italic_τ ( italic_t ) ), and its summary 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT will correspond to h(τ(t))𝜏𝑡h(\tau(t))italic_h ( italic_τ ( italic_t ) ). To obtain the required relationship of 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to 𝐡(t+1)superscript𝐡𝑡1\mathbf{h}^{(t+1)}bold_h start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT in order to remain invariant to time warping, we will use a one-step Taylor expansion of h(τ(t))𝜏𝑡h(\tau(t))italic_h ( italic_τ ( italic_t ) ):

h(τ(t+δ))h(τ(t))+δdh(τ(t))dt𝜏𝑡𝛿𝜏𝑡𝛿d𝜏𝑡d𝑡h(\tau(t+\delta))\approx h(\tau(t))+\delta\frac{\mathrm{d}h(\tau(t))}{\mathrm{d}t}italic_h ( italic_τ ( italic_t + italic_δ ) ) ≈ italic_h ( italic_τ ( italic_t ) ) + italic_δ divide start_ARG roman_d italic_h ( italic_τ ( italic_t ) ) end_ARG start_ARG roman_d italic_t end_ARG

and, once again, setting δ=1𝛿1\delta=1italic_δ = 1 and substituting Equation 52, then discretising:

𝐡(t+1)superscript𝐡𝑡1\displaystyle\mathbf{h}^{(t+1)}bold_h start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT =𝐡(t)+dτ(t)dtR(𝐳(t+1),𝐡(t))dτ(t)dt𝐡(t)absentsuperscript𝐡𝑡d𝜏𝑡d𝑡𝑅superscript𝐳𝑡1superscript𝐡𝑡d𝜏𝑡d𝑡superscript𝐡𝑡\displaystyle=\mathbf{h}^{(t)}+\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}R(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})-\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}\mathbf{h}^{(t)}= bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT
=dτ(t)dtR(𝐳(t+1),𝐡(t))+(1dτ(t)dt)𝐡(t)absentd𝜏𝑡d𝑡𝑅superscript𝐳𝑡1superscript𝐡𝑡1d𝜏𝑡d𝑡superscript𝐡𝑡\displaystyle=\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}R(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})+\left(1-\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}\right)\mathbf{h}^{(t)}= divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + ( 1 - divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG ) bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT

Finally, we swap dτ(t)dtd𝜏𝑡d𝑡\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG with the aforementioned learnable function, ΓΓ\Gammaroman_Γ. This gives us the required form for our time warping-invariant RNN:

𝐡(t+1)=Γ(𝐳(t+1),𝐡(t))R(𝐳(t+1),𝐡(t))+(1Γ(𝐳(t+1),𝐡(t)))𝐡(t)superscript𝐡𝑡1Γsuperscript𝐳𝑡1superscript𝐡𝑡𝑅superscript𝐳𝑡1superscript𝐡𝑡1Γsuperscript𝐳𝑡1superscript𝐡𝑡superscript𝐡𝑡\mathbf{h}^{(t+1)}=\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})R(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})+(1-\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)}))\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + ( 1 - roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ) bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (53)

We may quickly deduce that SimpleRNNs (Equation 39) are not time warping invariant, given that they do not feature the second term in Equation 53. Instead, they fully overwrite 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT with R(𝐳(t+1),𝐡(t))𝑅superscript𝐳𝑡1superscript𝐡𝑡R(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})italic_R ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), which corresponds to assuming no time warping at all; dτ(t)dt=1d𝜏𝑡d𝑡1\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}=1divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG = 1, i.e. τ(t)=t𝜏𝑡𝑡\tau(t)=titalic_τ ( italic_t ) = italic_t.

Further, our link between continuous-time RNNs and the discrete RNN based on R𝑅Ritalic_R rested on the accuracy of the Taylor approximation, which holds only if the time-warping derivative is not too large, i.e., dτ(t)dt1less-than-or-similar-tod𝜏𝑡d𝑡1\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}\lesssim 1divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG ≲ 1. The intuitive explanation of this is: if our time warping operation ever contracts time in a way that makes time increments (tt+1𝑡𝑡1t\rightarrow t+1italic_t → italic_t + 1) large enough that intermediate data changes are not sampled, the model can never hope to process time-warped inputs in the same way as original ones—it simply would not have access to the same information. Conversely, time dilations of any form (which, in discrete terms, correspond to interspersing the input time-series with zeroes) are perfectly allowed within our framework.

Combined with our requirement of monotonically increasing τ𝜏\tauitalic_τ (dτ(t)dt>0d𝜏𝑡d𝑡0\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}>0divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG > 0), we can bound the output space of ΓΓ\Gammaroman_Γ as 0<Γ(𝐳(t+1),𝐡(t))<10Γsuperscript𝐳𝑡1superscript𝐡𝑡10<\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})<10 < roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) < 1, which motivates the use of the logistic sigmoid activation for ΓΓ\Gammaroman_Γ, e.g.:

Γ(𝐳(t+1),𝐡(t))=logistic(𝐖Γ𝐳(t+1)+𝐔Γ𝐡(t)+𝐛Γ)Γsuperscript𝐳𝑡1superscript𝐡𝑡logisticsubscript𝐖Γsuperscript𝐳𝑡1subscript𝐔Γsuperscript𝐡𝑡subscript𝐛Γ\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})=\mathrm{logistic}({\bf W}_{\Gamma}\mathbf{z}^{(t+1)}+{\bf U}_{\Gamma}\mathbf{h}^{(t)}+\mathbf{b}_{\Gamma})roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_logistic ( bold_W start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT + bold_U start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT )

exactly matching the LSTM gating equations (e.g. Equation 44). The main difference is that LSTMs compute gating vectors, whereas Equation 53 implies ΓΓ\Gammaroman_Γ should output a scalar. Vectorised gates (Hochreiter, 1991) allow to fit a different warping derivative in every dimension of 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, allowing for reasoning over multiple time horizons simultaneously.

It is worth taking a pause here to summarise what we have done. By requiring that our RNN class is invariant to (non-destructive) time warping, we have derived the necessary form that it must have (Equation 53), and showed that it exactly corresponds to the class of gated RNNs. The gates’ primary role under this perspective is to accurately fit the derivative dτ(t)dtd𝜏𝑡d𝑡\frac{\mathrm{d}\tau(t)}{\mathrm{d}t}divide start_ARG roman_d italic_τ ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG of the warping transformation.

The notion of class invariance is somewhat distinct from the invariances we studied previously. Namely, once we train a gated RNN on a time-warped input with τ1(t)subscript𝜏1𝑡\tau_{1}(t)italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ), we typically cannot zero-shot transfer162162162One case where zero-shot transfer is possible is when the second time warping is assumed to be a time rescaling of the first one (τ2(t)=ατ1(t)subscript𝜏2𝑡𝛼subscript𝜏1𝑡\tau_{2}(t)=\alpha\tau_{1}(t)italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = italic_α italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t )). Transferring a gated RNN pre-trained on τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to a signal warped by τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT merely requires rescaling the gates: Γ2(𝐳(t+1),𝐡(t))=αΓ1(𝐳(t+1),𝐡(t))subscriptΓ2superscript𝐳𝑡1superscript𝐡𝑡𝛼subscriptΓ1superscript𝐳𝑡1superscript𝐡𝑡\Gamma_{2}(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})=\alpha\Gamma_{1}(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_α roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ). R𝑅Ritalic_R can retain its parameters (R1=R2subscript𝑅1subscript𝑅2R_{1}=R_{2}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).it to a signal warped by a different τ2(t)subscript𝜏2𝑡\tau_{2}(t)italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ). Rather, class invariance only guarantees that gated RNNs are powerful enough to fit both of these signals in the same manner, but potentially with vastly different model parameters. That being said, the realisation that effective gating mechanisms are tightly related to fitting the warping derivative can yield useful prescriptions for gated RNN optimisation, as we now briefly demonstrate.

For example, we can often assume that the range of the dependencies we are interested in tracking within our signal will be in the range [Tl,Th]subscript𝑇𝑙subscript𝑇[T_{l},T_{h}][ italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] time-steps.

By analysing the analytic solutions to Equation 52, it can be shown that the characteristic forgetting time of 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT by our gated RNN is proportional to 1Γ(𝐳(t+1),𝐡(t))1Γsuperscript𝐳𝑡1superscript𝐡𝑡\frac{1}{\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})}divide start_ARG 1 end_ARG start_ARG roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_ARG. Hence, we would like our gating values to lie between [1Th,1Tm]1subscript𝑇1subscript𝑇𝑚\left[\frac{1}{T_{h}},\frac{1}{T_{m}}\right][ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ] in order to effectively remember information within the assumed range.

Further, if we assume that 𝐳(t)superscript𝐳𝑡\mathbf{z}^{(t)}bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are roughly zero-centered—which is a common by-product of applying transformations such as layer normalisation (Ba et al., 2016)—we can assume that 𝔼[Γ(𝐳(t+1),𝐡(t))]logistic(𝐛Γ)𝔼delimited-[]Γsuperscript𝐳𝑡1superscript𝐡𝑡logisticsubscript𝐛Γ\mathbb{E}[\Gamma(\mathbf{z}^{(t+1)},\mathbf{h}^{(t)})]\approx\mathrm{logistic}(\mathbf{b}_{\Gamma})blackboard_E [ roman_Γ ( bold_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ] ≈ roman_logistic ( bold_b start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ). Controlling the bias vector of the gating mechanism is hence a very powerful way of controlling the effective gate value163163163This insight was already spotted by Gers and Schmidhuber (2000); Jozefowicz et al. (2015), who empirically recommended initialising the forget-gate bias of LSTMs to a constant positive vector, such as 𝟏1\mathbf{1}bold_1..

Combining the two observations, we conclude that an appropriate range of gating values can be obtained by initialising 𝐛Γlog(𝒰(Tl,Th)1)similar-tosubscript𝐛Γ𝒰subscript𝑇𝑙subscript𝑇1\mathbf{b}_{\Gamma}\sim-\log(\mathcal{U}(T_{l},T_{h})-1)bold_b start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ∼ - roman_log ( caligraphic_U ( italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - 1 ), where 𝒰𝒰\mathcal{U}caligraphic_U is the uniform real distribution. Such a recommendation was dubbed chrono initialisation by Tallec and Ollivier (2018), and has been empirically shown to improve the long-range dependency modelling of gated RNNs.

Sequence-to-sequence learning with RNNs

One prominent historical example of using RNN-backed computation are sequence-to-sequence translation tasks, such as machine translation of natural languages. The pioneering seq2seq work by Sutskever et al. (2014) achieved this by passing the summary vector, 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT as an initial input for a decoder RNN, with outputs of RNN blocks being given as inputs for the next step.

Refer to caption
Figure 21: One typical example of a seq2seq architecture with an RNN encoder Rencsubscript𝑅encR_{\mathrm{enc}}italic_R start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT and RNN decoder Rdecsubscript𝑅decR_{\mathrm{dec}}italic_R start_POSTSUBSCRIPT roman_dec end_POSTSUBSCRIPT. The decoder is seeded with the final summary vector 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT coming out of the encoder, and then proceeds in an autoregressive fashion: at each step, the predicted output from the previous step is fed back as input to Rdecsubscript𝑅decR_{\mathrm{dec}}italic_R start_POSTSUBSCRIPT roman_dec end_POSTSUBSCRIPT. The bottleneck problem is also illustrated with the red lines: the summary vector 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT is pressured to store all relevant information for translating the input sequence, which becomes increasingly challenging as the input length grows.

This placed substantial representational pressure on the summary vector, 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT. Within the context of deep learning, 𝐡(T)superscript𝐡𝑇\mathbf{h}^{(T)}bold_h start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT is sometimes referred to as a bottleneck164164164[Uncaptioned image]
The bottleneck effect has recently received substantial attention in the graph representation learning community (Alon and Yahav, 2020), as well as neural algorithmic reasoning (Cappart et al., 2021).
. Its fixed capacity must be sufficient for representing the content of the entire input sequence, in a manner that is conducive to generating a corresponding sequence, while also supporting input sequences of substantially different lengths (Figure 21).

In reality, different steps of the output may wish to focus (attend) on different parts of the input, and all such choices are difficult to represent via a bottleneck vector. Following from this observation, the popular recurrent attention model was proposed by Bahdanau et al. (2014). At every step of processing, a query vector is generated by an RNN; this query vector then interacts with the representation of every time-step 𝐡(t)superscript𝐡𝑡\mathbf{h}^{(t)}bold_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, primarily by computing a weighted sum over them. This model pioneered neural content-based attention and predates the success of the Transformer model.

Lastly, while attending offers a soft way to dynamically focus on parts of the input content, substantial work also learnt more explicit ways to direct attention to the input. A powerful algorithmically grounded way of doing so is the pointer network of Vinyals et al. (2015), which proposes a simple modification of recurrent attention to allow for pointing over elements of variable-sized inputs. These findings have then been generalised to the set2set architecture (Vinyals et al., 2016), which generalises seq2seq models to unordered sets, supported by pointer network-backed LSTMs.

6 Problems and Applications

Invariances and symmetries arise all too commonly across data originating in the real world. Hence, it should come as no surprise that some of the most popular applications of machine learning in the 21st century have come about as a direct byproduct of Geometric Deep Learning, perhaps sometimes without fully realising this fact. We would like to provide readers with an overview—by no means comprehensive—of influential works in Geometric Deep Learning and exciting and promising new applications. Our motivation is twofold: to demonstrate specific instances of scientific and industrial problems where the five geometric domains commonly arise, and to serve additional motivation for further study of Geometric Deep Learning principles and architectures.

Chemistry and Drug Design

One of the most promising applications of representation learning on graphs is in computational chemistry and drug development.165165165Many drugs are not designed but discovered, often serendipitously. The historic source of a number of drugs from the plant kingdom is reflected in their names: e.g., the acetylsalicylic acid, commonly known as aspirin, is contained in the bark of the willow tree (Salix alba), whose medicinal properties are known since antiquity. Traditional drugs are small molecules that are designed to chemically attach (‘bind’) to some target molecule, typically a protein, in order to activate or disrupt some disease-related chemical process. Unfortunately, drug development is an extremely long and expensive process: at the time of writing, bringing a new drug to the market typically takes more than a decade and costs more than a billion dollars. One of the reasons is the cost of testing where many drugs fail at different stages – less than 5% of candidates make it to the last stage (see e.g. Gaudelet et al. (2020)).

Since the space of chemically synthesisable molecules is very large (estimated around 1060superscript106010^{60}10 start_POSTSUPERSCRIPT 60 end_POSTSUPERSCRIPT), the search for candidate molecules with the right combination of properties such as target binding affinity, low toxicity, solubility, etc. cannot be done experimentally, and virtual or in silico screening (i.e., the use of computational techniques to identify promising molecules), is employed. Machine learning techniques play an increasingly more prominent role in this task. A prominent example of the use of Geometric Deep Learning for virtual drug screening was recently shown by Stokes et al. (2020) using a graph neural network trained to predict whether or not candidate molecules inhibit growth166166166 [Uncaptioned image]
Molecular graph of Halicin.
in the model bacterium Escherichia coli, they were able to effectively discover that Halicin, a molecule originally indicated for treating diabetes, is a highly potent antibiotic, even against bacteria strains with known antibiotic resistance. This discovery was widely covered in both scientific and popular press.

Speaking more broadly, the application of graph neural networks to molecules modeled as graphs has been a very active field, with multiple specialised architectures proposed recently that are inspired by physics and e.g. incorporate equivariance to rotations and translations (see e.g. Thomas et al. (2018); Anderson et al. (2019); Fuchs et al. (2020); Satorras et al. (2021)). Further, Bapst et al. (2020) have successfully demonstrated the utility of GNNs for predictively modelling the dynamics of glass, in a manner that outperformed the previously available physics-based models. Historically, many works in computational chemistry were precursors of modern graph neural network architectures sharing many common traits with them.

Drug Repositioning

While generating entirely novel drug candidates is a potentially viable approach, a faster and cheaper avenue for developing new therapies is drug repositioning, which seeks to evaluate already-approved drugs (either alone or in combinations) for a novel purpose. This often significantly decreases the amount of clinical evaluation that is necessary to release the drug to the market. At some level of abstraction, the action of drugs on the body biochemistry and their interactions between each other and other biomolecules can be modeled as a graph, giving rise to the concept of ‘network medicine’ coined by the prominent network scientist Albert-László Barabási and advocating the use of biological networks (such as protein-protein interactions and metabolic pathways) to develop new therapies (Barabási et al., 2011).

Geometric Deep Learning offers a modern take on this class of approaches. A prominent early example is the work of Zitnik et al. (2018), who used graph neural networks to predict side effects in a form of drug repositioning known as combinatorial therapy or polypharmacy, formulated as edge prediction in a drug-drug interaction graph. The novel coronavirus pandemic, which is largely ongoing at the time of writing this text, has sparked a particular interest in attempting to apply such approaches against COVID-19 (Gysi et al., 2020). Finally, we should note that drug repositioning is not necessarily limited to synthetic molecules: Veselkov et al. (2019) applied similar approaches to drug-like molecules contained in food (since, as we mentioned, many plant-based foods contain biological analogues of compounds used in oncological therapy). One of the authors of this text is involved in a collaboration adding a creative twist to this research, by partnering with a molecular chef that designs exciting recipes based on the ‘hyperfood’ ingredients rich in such drug-like molecules.

Protein biology

Since we have already mentioned proteins as drug targets, lets us spend a few more moments on this topic. Proteins are arguably among the most important biomolecules that have myriads of functions in our body, including protection against pathogens (antibodies), giving structure to our skin (collagen), transporting oxygen to cells (haemoglobin), catalysing chemical reactions (enzymes), and signaling (many hormones are proteins). Chemically speaking, a protein is a biopolymer, or a chain of small building blocks called aminoacids that under the influence of electrostatic forces fold into a complex 3D structure. It is this structure that endows the protein with its functions,167167167A common metaphor, dating back to the chemistry Nobel laureate Emil Fischer is the Schlüssel-Schloss-Prinzip (‘key-lock principle’, 1894): two proteins often only interact if they have geometrically and chemically complementary structures. and hence it is crucial to the understanding of how proteins work and what they do. Since proteins are common targets for drug therapies, the pharmaceutical industry has a keen interest in this field.

A typical hierarchy of problems in protein bioinformatics is going from protein sequence (a 1D string over an alphabet of of 20 different amino acids) to 3D structure (a problem known as ‘protein folding’) to function (‘protein function prediction’). Recent approaches such as DeepMind’s AlphaFold by Senior et al. (2020) used contact graphs to represent the protein structure. Gligorijevic et al. (2020) showed that applying graph neural networks on such graphs allows to achieve better function prediction than using purely sequence-based methods.

Gainza et al. (2020) developed168168168 [Uncaptioned image]
Oncologial target PD-L1 protein surface (heat map indicated the predicted binding site) and the designed binder (shown as ribbon diagram).
a Geometric Deep Learning pipeline called MaSIF predicting interactions between proteins from their 3D structure. MaSIF models the protein as a molecular surface discretised as a mesh, arguing that this representation is advantageous when dealing with interactions as it allows to abstract the internal fold structure. The architecture was based on mesh convolutional neural network operating on pre-computed chemical and geometric features in small local geodesic patches. The network was trained using a few thousand co-crystal protein 3D structures from the Protein Data Bank to address multiple tasks, including interface prediction, ligand classification, and docking, and allowed to do de novo (‘from scratch’) design of proteins that could in principle act as biological immunotherapy drug against cancer – such proteins are designed to inhibit protein-protein interactions (PPI) between parts of the programmed cell death protein complex (PD-1/PD-L1) and give the immune system the ability to attack the tumor cells.

Recommender Systems and Social Networks

The first popularised large-scale applications of graph representation learning have occurred within social networks169169169[Uncaptioned image], primarily in the context of recommender systems. Recommenders are tasked with deciding which content to serve to users, potentially depending on their previous history of interactions on the service. This is typically realised through a link prediction objective: supervise the embeddings of various nodes (pieces of content) such that they are kept close together if they are deemed related (e.g. commonly viewed together). Then the proximity of two embeddings (e.g. their inner product) can be interpreted as the probability that they are linked by an edge in the content graph, and hence for any content queried by users, one approach could serve its k𝑘kitalic_k nearest neighbours in the embedding space.

Among the pioneers of this methodology is the American image sharing and social media company Pinterest: besides presenting one of the first successful deployments of GNNs in production, their method, PinSage170170170Pinterest had also presented follow-up work, PinnerSage (Pal et al., 2020), which effectively integrates user-specific contextual information into the recommender., successfully made graph representation learning scalable to graphs of millions of nodes and billions of edges (Ying et al., 2018). Related applications, particularly in the space of product recommendations, soon followed. Popular GNN-backed recommenders that are currently deployed in production include Alibaba’s Aligraph (Zhu et al., 2019) and Amazon’s P-Companion (Hao et al., 2020). In this way, graph deep learning is influencing millions of people on a daily level.

Within the context of content analysis on social networks, another noteworthy effort is Fabula AI, which is among the first GNN-based startups to be acquired (in 2019, by Twitter). The startup, founded by one of the authors of the text and his team, developed novel technology for detecting misinformation on social networks (Monti et al., 2019). Fabula’s solution consists of modelling the spread of a particular news item by the network of users who shared it. The users are connected if one of them re-shared the information from the other, but also if they follow each other on the social network. This graph is then fed into a graph neural network, which classifies the entire graph as either ‘true’ or ‘fake’ content – with labels based on agreement between fact-checking bodies. Besides demonstrating strong predictive power which stabilises quickly (often within a few hours of the news spreading), analysing the embeddings of individual user nodes revealed clear clustering of users who tend to share incorrect information, exemplifying the well-known ‘echo chamber’ effect.

Traffic forecasting

Transportation networks are another area171171171[Uncaptioned image]
[Uncaptioned image]
A road network (top) with its corresponding graph representation (bottom).
where Geometric Deep Learning techniques are already making an actionable impact over billions of users worldwide. For example, on road networks, we can observe intersections as nodes, and road segments as edges connecting them—these edges can then be featurised by the road length, current or historical speeds along their segment, and the like.

One standard prediction problem in this space is predicting the estimated time of arrival (ETA): for a given candidate route, providing the expected travel time necessary to traverse it. Such a problem is essential in this space, not only for user-facing traffic recommendation apps, but also for enterprises (such as food delivery or ride-sharing services) that leverage these predictions within their own operations.

Graph neural networks172172172[Uncaptioned image]
Several of the metropolitan areas where GNNs are serving queries within Google Maps, with indicated relative improvements in prediction quality (40+% in cities like Sydney).
have shown immense promise in this space as well: they can, for example, be used to directly predict the ETA for a relevant subgraph of the road network (effectively, a graph regression task). Such an approach was successfully leveraged by DeepMind, yielding a GNN-based ETA predictor which is now deployed in production at Google Maps (Derrow-Pinion et al., 2021), serving ETA queries in several major metropolitan areas worldwide. Similar returns have been observed by the Baidu Maps team, where travel time predictions are currently served by the ConSTGAT model, which is itself based on a spatio-temporal variant of the graph attention network model (Fang et al., 2020).

Object recognition

A principal benchmark for machine learning173173173[Uncaptioned image]
One example input image, the likes of which can be found in ImageNet, representing the “tabby cat” class.
techniques in computer vision is the ability to classify a central object within a provided image. The ImageNet large scale visual recognition challenge (Russakovsky et al., 2015, ILSVRC) was an annual object classification challenge that propelled much of the early development in Geometric Deep Learning. ImageNet requires models to classify realistic images scraped from the Web into one of 1000 categories: such categories are at the same time diverse (covering both animate and inanimate objects), and specific (with many classes focused on distinguishing various cat and dog breeds). Hence, good performance on ImageNet often implies a solid level of feature extraction from general photographs, which formed a foundation for various transfer learning setups from pre-trained ImageNet models.

The success of convolutional neural networks on ImageNet—particularly the AlexNet model of Krizhevsky et al. (2012), which swept ILSVRC 2012 by a large margin—has in a large way spearheaded the adoption of deep learning as a whole, both in academia and in industry. Since then, CNNs have consistently ranked on top of the ILSVRC, spawning many popular architectures such as VGG-16 (Simonyan and Zisserman, 2014)174174174Interestingly, the VGG-16 architecture has sixteen convolutional layers and is denoted as “very deep” by the authors. Subsequent developments quickly scaled up such models to hundreds or even thousands of layers., Inception (Szegedy et al., 2015) and ResNets (He et al., 2016), which have successfully surpassed human-level performance on this task. The design decisions and regularisation techniques employed by these architectures (such as rectified linear activations (Nair and Hinton, 2010), dropout (Srivastava et al., 2014), skip connections (He et al., 2016) and batch normalisation (Ioffe and Szegedy, 2015)) form the backbone of many of the effective CNN models in use today.

Concurrently with object classification, significant progress had been made on object detection; that is, isolating all objects of interest within an image, and tagging them with certain classes. Such a task is relevant in a variety of downstream problems, from image captioning all the way to autonomous vehicles. It necessitates a more fine-grained approach, as the predictions need to be localised; as such, often, translation equivariant models have proven their worth in this domain. One impactful example in this space includes the R-CNN family of models (Girshick et al., 2014; Girshick, 2015; Ren et al., 2015; He et al., 2017) whereas, in the related field of semantic segmentation, the SegNet model of Badrinarayanan et al. (2017) proved influential, with its encoder-decoder architecture relying on the VGG-16 backbone.

Game playing

Convolutional neural networks also play a prominent role as translation-invariant feature extractors in reinforcement learning (RL) environments, whenever the observed state can be represented in a grid domain; e.g. this is the case when learning to play video games from pixels. In this case, the CNN is responsible for reducing the input to a flat vector representation, which is then used for deriving policy or value functions that drive the RL agent’s behaviour. While the specifics of reinforcement learning are not the focus of this section, we do note that some of the most impactful results of deep learning in the past decade have come about through CNN-backed reinforcement learning.

One particular example that is certainly worth mentioning here is DeepMind’s AlphaGo (Silver et al., 2016). It encodes the current state within a game of Go by applying a CNN to the 19×19191919\times 1919 × 19 grid representing the current positions of the placed stones. Then, through a combination of learning from previous expert moves, Monte Carlo tree search, and self-play, it had successfully reached a level of Go mastery that was sufficient to outperform Lee Sedol, one of the strongest Go players of all time, in a five-round challenge match that was widely publicised worldwide.

While this already represented a significant milestone for broader artificial intelligence—with Go having a substantially more complex state-space than, say, chess175175175[Uncaptioned image] The game of Go is played on a 19×19191919\times 1919 × 19 board, with two players placing white and black stones on empty fields. The number of legal states has been estimated at 2×10170absent2superscript10170\approx 2\times 10^{170}≈ 2 × 10 start_POSTSUPERSCRIPT 170 end_POSTSUPERSCRIPT (Tromp and Farnebäck, 2006), vastly outnumbering the number of atoms in the universe.—the development of AlphaGo did not stop there. The authors gradually removed more and more Go-specific biases from the architecture, with AlphaGo Zero removing human biases, optimising purely through self-play (Silver et al., 2017), AlphaZero expands this algorithm to related two-player games, such as Chess and Shogi; lastly, MuZero (Schrittwieser et al., 2020) incorporates a model that enables learning the rules of the game on-the-fly, which allows reaching strong performance in the Atari 2600 console, as well as Go, Chess and Shogi, without any upfront knowledge of the rules. Throughout all of these developments, CNNs remained the backbone behind these models’ representation of the input.

While several high-performing RL agents were proposed for the Atari 2600 platform over the years (Mnih et al., 2015, 2016; Schulman et al., 2017), for a long time they were unable to reach human-level performance on all of the 57 games provided therein. This barrier was finally broken with Agent57 (Badia et al., 2020), which used a parametric family of policies, ranging from strongly exploratory to purely exploitative, and prioritising them in different ways during different stages of training. It, too, powers most of its computations by a CNN applied to the video game’s framebuffer.

Text and speech synthesis

Besides images (which naturally map to a two-dimensional grid), several of (geometric) deep learning’s strongest successes have happened on one-dimensional grids. Natural examples of this are text and speech, folding the Geometric Deep Learning blueprint within diverse areas such as natural language processing and digital signal processing.

Some of the most widely applied and publicised works in this space focus on synthesis: being able to generate speech or text, either unconditionally or conditioned on a particular prompt. Such a setup can support a plethora of useful tasks, such as text-to-speech (TTS), predictive text completion, and machine translation. Various neural architectures for text and speech generation have been proposed over the past decade, initially mostly based on recurrent neural networks (e.g. the aforementioned seq2seq model (Sutskever et al., 2014) or recurrent attention (Bahdanau et al., 2014)). However, in recent times, they have been gradually replaced by convolutional neural networks and Transformer-based architectures.

One particular limitation of simple 1D convolutions in this setting is their linearly growing receptive field, requiring many layers in order to cover the sequence generated so far. Dilated176176176Dilated convolution is also referred to as à trous convolution (literally “holed” in French).convolutions, instead, offer an exponentially growing receptive field with an equivalent number of parameters. Owing to this, they proved a very strong alternative, eventually becoming competitive with RNNs on machine translation (Kalchbrenner et al., 2016), while drastically reducing the computational complexity, owing to their parallelisability across all input positions.177177177Such techniques have also outperformed RNNs on problems as diverse as protein-protein interaction (Deac et al., 2019).The most well-known application of dilated convolutions is the WaveNet model from van den Oord et al. (2016a). WaveNets demonstrated that, using dilations, it is possible to synthesise speech at the level of raw waveform (typically 16,000 samples per second or more), producing speech samples that were significantly more “human-like” than the best previous text-to-speech (TTS) systems178178178Besides this, the WaveNet model proved capable of generating piano pieces.. Subsequently, it was further demonstrated that the computations of WaveNets can be distilled in a much simpler model, the WaveRNN (Kalchbrenner et al., 2018)—and this model enabled effectively deploying this technology at an industrial scale. This allowed not only its deployment for large-scale speech generation for services such as the Google Assistant, but also allowing for efficient on-device computations; e.g. for Google Duo, which uses end-to-end encryption.

Transformers (Vaswani et al., 2017) have managed to surpass the limitations of both recurrent and convolutional architectures, showing that self-attention is sufficient for achieving state-of-the-art performance in machine translation. Subsequently, they have revolutionised natural language processing. Through the pre-trained embeddings provided by models such as BERT (Devlin et al., 2018), Transformer computations have become enabled for a large amount of downstream applications of natural language processing—for example, Google uses BERT embeddings to power its search engine.

Arguably the most widely publicised application of Transformers in the past years is text generation, spurred primarily by the Generative Pre-trained Transformer (GPT, Radford et al. (2018, 2019); Brown et al. (2020)) family of models from OpenAI. In particular, GPT-3 (Brown et al., 2020) successfully scaled language model learning to 175 billion learnable parameters, trained on next-word prediction on web-scale amounts of scraped textual corpora. This allowed it not only to become a highly-potent few-shot learner on a variety of language-based tasks, but also a text generator with capability to produce coherent and human-sounding pieces of text. This capability not only implied a large amount of downstream applications, but also induced a vast media coverage.

Healthcare

Applications in the medical domain are another promising area for Geometric Deep Learning. There are multiple ways in which these methods are being used. First, more traditional architectures such as CNNs have been applied to grid-structured data, for example, for the prediction of length of stay in Intensive Care Units (Rocheteau et al., 2020), or diagnosis of sight-threatening diseases from retinal scans (De Fauw et al., 2018). Winkels and Cohen (2019) showed that using 3D roto-translation group convolutional networks improves the accuracy of pulmonary nodule detection compared to conventional CNNs.

Second, modelling organs as geometric surfaces, mesh convolutional neural networks were shown to be able to address a diverse range of tasks, from reconstructing facial structure from genetics-related information (Mahdi et al., 2020) to brain cortex parcellation (Cucurull et al., 2018) to regressing demographic properties from cortical surface structures (Besson et al., 2020). The latter examples represent an increasing trend in neuroscience to consider the brain as a surface with complex folds179179179Such structure of the brain cortex are called sulci and gyri in anatomical literature.giving rise to highly non-Euclidean structures.

At the same time, neuroscientists often try construct and analyse functional networks of the brain representing the various regions of the brain that are activated together when performing some cognitive function; these networks are often constructed using functional magnetic resonance imaging (fMRI) that shows in real time which areas of the brain consume more blood.180180180Typically, Blood Oxygen-Level Dependent (BOLD) contrast imaging is used. These functional networks can reveal patient demographics (e.g., telling apart males from females, Arslan et al. (2018)), as well as used for neuropathology diagnosis, which is the third area of application of Geometric Deep Learning in medicine we would like to highlight here. In this context, Ktena et al. (2017) pioneered the use of graph neural networks for the prediction of neurological conditions such as Autism Spectrum Disorder. The geometric and functional structure of the brain appears to be intimately related, and recently Itani and Thanou (2021) pointed to the benefits of exploiting them jointly in neurological disease analysis.

Fourth, patient networks are becoming more prominent in ML-based medical diagnosis. The rationale behind these methods is that the information of patient demographic, genotypic, and phenotypic similarity could improve predicting their disease. Parisot et al. (2018) applied graph neural networks on networks of patients created from demographic features for neurological disease diagnosis, showing that the use of the graph improves prediction results. Cosmo et al. (2020) showed the benefits of latent graph learning (by which the network learns an unknown patient graph) in this setting. The latter work used data from the UK Biobank, a large-scale collection of medical data including brain imaging (Miller et al., 2016).

A wealth of data about hospital patients may be found in electronic health records (EHRs)181181181Publicly available anonymised critical-care EHR datasets include MIMIC-III (Johnson et al., 2016) and eICU (Pollard et al., 2018).. Besides giving a comprehensive view of the patient’s progression, EHR analysis allows for relating similar patients together. This aligns with the pattern recognition method, which is commonly used in diagnostics. Therein, the clinician uses experience to recognise a pattern of clinical characteristics, and it may be the primary method used when the clinician’s experience may enable them to diagnose the condition quickly. Along these lines, several works attempt to construct a patient graph based on EHR data, either by analysing the embeddings of their doctor’s notes (Malone et al., 2018), diagnosis similarity on admission (Rocheteau et al., 2021), or even assuming a fully-connected graph (Zhu and Razavian, 2019). In all cases, promising results have been shown in favour of using graph representation learning for processing EHRs.

Particle physics and astrophysics

High energy physicists were perhaps among the first domain experts in the field of natural sciences to embrace the new shiny tool, graph neural networks. In a recent review paper, Shlomi et al. (2020)182182182[Uncaptioned image] Part of the Large Hadron Collider detectors. note that machine learning has historically been heavily used in particle physics experiments, either to learn complicated inverse functions allowing to infer the underlying physics process from the information measured in the detector, or to perform classification and regression tasks. For the latter, it was often necessary to force the data into an unnatural representation such as grid, in order to be able to used standard deep learning architectures such as CNN. Yet, many problems in physics involve data in the form of unordered sets with rich relations and interactions, which can be naturally represented as graphs.

One important application in high-energy physics is the reconstruction and classification of particle jets – sprays of stable particles arising from multiple successive interaction and decays of particles originating from a single initial event. In the Large Hardon Collider, the largest and best-known particle accelerator built at CERN, such jet are the result of collisions of protons at nearly the speed of light. These collisions produce massive particles, such as the long though-for Higgs boson or the top quark. The identification and classification of collision events is of crucial importance, as it might provide experimental evidence to the existence of new particles.

Multiple Geometric Deep Learning approaches183183183[Uncaptioned image] Example of a particle jet. have recently been proposed for particle jet classification task, e.g. by Komiske et al. (2019) and Qu and Gouskos (2019), based on DeepSet and Dynamic Graph CNN architectures, respectively. More recently, there has also been interest in developing specialsed architectures derived from physics consideration and incorporating inductive biases consistent with Hamiltonian or Lagrangian mechanics (see e.g. Sanchez-Gonzalez et al. (2019); Cranmer et al. (2020)), equivariant to the Lorentz group (a fundamental symmetry of space and time in physics) (Bogatskiy et al., 2020), or even incorporating symbolic reasoning (Cranmer et al., 2019) and capable of learning physical laws from data. Such approaches are more interpretable (and thus considered more ‘trustworthy’ by domain experts) and also offer better generalisation.

Besides particle accelerators, particle detectors are now being used by astrophysicist for multi-messenger astronomy – a new way of coordinated observation of disparate signals, such as electromagnetic radiation, gravitational waves, and neutrinos, coming from the same source. Neutrino astronomy is of particular interest, since neutrinos interact only very rarely with matter, and thus travel enormous distances practically unaffected.184184184[Uncaptioned image] The characteristic pattern of light deposition in IceCube detector from background events (muon bundles, left) and astrophysical neutrinos (high-energy single muon, right). Choma et al. (2018) Detecting neutrinos allows to observe objects inaccessible to optical telescopes, but requires enormously-sized detectors – the IceCube neutrino observatory uses a cubic kilometer of Antarctic ice shelf on the South Pole as its detector. Detecting high-energy neutrinos can possibly shed lights on some of the most mysterious objects in the Universe, such as blazars and black holes. Choma et al. (2018) used a Geometric neural network to model the irregular geometry of the IceCube neutrino detector, showing significantly better performance in detecting neutrinos coming from astrophysical sources and separating them from background events.

While neutrino astronomy offers a big promise in the study of the Cosmos, traditional optical and radio telescopes are still the ‘battle horses’ of astronomers. With these traditional instruments, Geometric Deep Learning can still offer new methodologies for data analysis. For example, Scaife and Porter (2021) used rotationally-equivariant CNNs for the classification of radio galaxies, and McEwen et al. (2021) used spherical CNNs for the analysis of cosmic microwave background radiation, a relic from the Big Bang that might shed light on the formation of the primordial Universe. As we already mentioned, such signals are naturally represented on the sphere and equivariant neural networks are an appropriate tool to study them.

Virtual and Augmented Reality

Another field of applications which served as the motivation for the development of a large class of Geometric Deep Learning methods is computer vision and graphics, in particular, dealing with 3D body models for virtual and augmented reality. Motion capture technology used to produce special effects in movies like Avatar often operates in two stages: first, the input from a 3D scanner capturing the motions of the body or the face of the actor is put into correspondence with some canonical shape, typically modelled as a discrete manifold or a mesh (this problem is often called ‘analysis’). Second, a new shape is generated to repeat the motion of the input (‘synthesis’). Initial works on Geometric Deep Learning in computer graphics and vision (Masci et al., 2015; Boscaini et al., 2016a; Monti et al., 2017) developed mesh convolutional neural networks to address the analysis problem, or more specifically, deformable shape correspondence.

First geometric autoencoder architectures for 3D shape synthesis were proposed independently by Litany et al. (2018) and Ranjan et al. (2018). In these architectures, a canonical mesh (of the body, face, or hand) was assumed to be known and the synthesis task consisted of regressing the 3D coordinates of the nodes (the embedding of the surface, using the jargon of differential geometry). Kulon et al. (2020) showed 185185185 [Uncaptioned image]
Examples of complex 3D hand poses reconstructed from 2D images in the wild (Kulon et al., 2020).
a hybrid pipeline for 3D hand pose estimation with an image CNN-based encoder and a geometric decoder. A demo of this system, developed in collaboration with a British startup company Ariel AI and presented at CVPR 2020, allowed to create realistic body avatars with fully articulated hands from video input on a mobile phone faster than real-time. Ariel AI was acquired by Snap in 2020, and at the time of writing its technology is used in Snap’s augmented reality products.

7 Historic Perspective

“Symmetry, as wide or as narrow as you may define its meaning, is one idea by which man through the ages has tried to comprehend and create order, beauty, and perfection.” 186186186 [Uncaptioned image] The tetrahedron, cube, octahedron, dodecahedron, and icosahedron are called Platonic solids.This somewhat poetic definition of symmetry is given in the eponymous book of the great mathematician Hermann Weyl (2015), his Schwanengesang on the eve of retirement from the Institute for Advanced Study in Princeton. Weyl traces the special place symmetry has occupied in science and art to the ancient times, from Sumerian symmetric designs to the Pythagoreans who believed the circle to be perfect due to its rotational symmetry. Plato considered the five regular polyhedra bearing his name today so fundamental that they must be the basic building blocks shaping the material world. Yet, though Plato is credited with coining the term ςυμμετρία, which literally translates as ‘same measure’, he used it only vaguely to convey the beauty of proportion in art and harmony in music. It was the astronomer and mathematician Johannes Kepler to attempt the first rigorous analysis of the symmetric shape of water crystals. In his treatise (‘On the Six-Cornered Snowflake’),187187187 Fully titled Strena, Seu De Nive Sexangula (’New Year’s gift, or on the Six-Cornered Snowflake’) was, as suggested by the title, a small booklet sent by Kepler in 1611 as a Christmas gift to his patron and friend Johannes Matthäus Wackher von Wackenfels. he attributed the six-fold dihedral structure of snowflakes to hexagonal packing of particles – an idea that though preceded the clear understanding of how matter is formed, still holds today as the basis of crystallography (Ball, 2011).

Symmetry in Mathematics and Physics

In modern mathematics, symmetry is almost univocally expressed in the language of group theory. The origins of this theory are usually attributed to Évariste Galois, who coined the term and used it to study solvability of polynomial equations in the 1830s. Two other names associated with group theory are those of Sophus Lie and Felix Klein, who met and worked fruitfully together for a period of time (Tobies, 2019). The former would develop the theory of continuous symmetries that today bears his name; the latter proclaimed group theory to be the organising principle of geometry in his Erlangen Program, which we mentioned in the beginning of this text. Riemannian geometry was explicitly excluded from Klein’s unified geometric picture, and it took another fifty years before it was integrated, largely thanks to the work of Élie Cartan in the 1920s.

Emmy Noether, Klein’s colleague in Göttingen, proved that every differentiable symmetry of the action of a physical system has a corresponding conservation law (Noether, 1918). In physics, it was a stunning result: beforehand, meticulous experimental observation was required to discover fundamental laws such as the conservation of energy, and even then, it was an empirical result not coming from anywhere. Noether’s Theorem — “a guiding star to 20th and 21st century physics”, in the words of the Nobel laureate Frank Wilczek — showed that the conservation of energy emerges from the translational symmetry of time, a rather intuitive idea that the results of an experiment should not depend on whether it is conducted today or tomorrow.

The symmetry188188188Weyl first conjectured (incorrectly) in 1919 that invariance under the change of scale or “gauge” was a local symmetry of electromagnetism. The term gauge, or Eich in German, was chosen by analogy to the various track gauges of railroads. After the development of quantum mechanics, Weyl (1929) modified the gauge choice by replacing the scale factor with a change of wave phase. See Straumann (1996). associated with charge conservation is the global gauge invariance of the electromagnetic field, first appearing in Maxwell’s formulation of electrodynamics (Maxwell, 1865); however, its importance initially remained unnoticed. The same Hermann Weyl who wrote so dithyrambically about symmetry is the one who first introduced the concept of gauge invariance in physics in the early 20th century, emphasizing its role as a principle from which electromagnetism can be derived. It took several decades until this fundamental principle — in its generalised form developed by Yang and Mills (1954) — proved successful in providing a unified framework to describe the quantum-mechanical behavior of electromagnetism and the weak and strong forces, finally culminating in the Standard Model that captures all the fundamental forces of nature but gravity. We can thus join another Nobel-winning physicist, Philip Anderson (1972), in concluding that “it is only slightly overstating the case to say that physics is the study of symmetry.”

Early Use of Symmetry in Machine Learning

In machine learning and its applications to pattern recognition and computer vision, the importance of symmetry has long been recognised. Early work on designing equivariant feature detectors for pattern recognition was done by Amari (1978),189189189Shun’ichi Amari is credited as the creator of the field of information geometry that applies Riemannian geometry models to probability. The main object studied by information geometry is a statistical manifold, where each point corresponds to a probability distribution. Kanatani (2012), and Lenz (1990). In the neural networks literature, the famous Group Invariance Theorem for Perceptrons by Minsky and Papert (2017) puts fundamental limitations on the capabilities of (single-layer) perceptrons to learn invariants. This was one of the primary motivations for studying multi-layer architectures (Sejnowski et al., 1986; Shawe-Taylor, 1989, 1993), which ultimately led to deep learning.

In the neural network community, Neocognitron (Fukushima and Miyake, 1982) is credited as the first implementation of shift invariance in a neural network for “pattern recognition unaffected by shift in position”. His solution came in the form of hierarchical neural network with local connectivity, drawing inspiration from the receptive fields discovered in the visual cortex by the neuroscientists David Hubel and Torsten Wiesel two decades earlier (Hubel and Wiesel, 1959). 190190190This classical work was recognised by the Nobel Prize in Medicine in 1981, which Hubel and Wiesel shared with Roger Sperry. These ideas culminated in Convolutional Neural Networks in the seminal work of Yann LeCun and co-authors (LeCun et al., 1998). The first work to take a representation-theoretical view on invariant and equivariant neural networks was performed by Wood and Shawe-Taylor (1996), unfortunately rarely cited. More recent incarnations of these ideas include the works of Makadia et al. (2007); Esteves et al. (2020) and one of the authors of this text (Cohen and Welling, 2016).

Graph Neural Networks

It is difficult to pinpoint exactly when the concept of Graph Neural Networks began to emerge—partly due to the fact that most of the early work did not place graphs as a first-class citizen, partly since GNNs became practical only in the late 2010s, and partly because this field emerged from the confluence of several research areas. That being said, early forms of graph neural networks can be traced back at least to the 1990s, with examples including Alessandro Sperduti’s Labeling RAAM (Sperduti, 1994), the “backpropagation through structure” of Goller and Kuchler (1996), and adaptive processing of data structures (Sperduti and Starita, 1997; Frasconi et al., 1998). While these works were primarily concerned with operating over “structures” (often trees or directed acyclic graphs), many of the invariances preserved in their architectures are reminiscent of the GNNs more commonly in use today.

The first proper treatment of the processing of generic graph structures (and the coining of the term “graph neural network”) happened after the turn of the 21st century.191191191Concurrently, Alessio Micheli had proposed the neural network for graphs (NN4G) model, which focused on a feedforward rather than recurrent paradigm (Micheli, 2009).Within the Artificial Intelligence lab at the Università degli Studi di Siena (Italy), papers led by Marco Gori and Franco Scarselli have proposed the first “GNN” (Gori et al., 2005; Scarselli et al., 2008). They relied on recurrent mechanisms, required the neural network parameters to specify contraction mappings, and thus computing node representations by searching for a fixed point—this in itself necessitated a special form of backpropagation (Almeida, 1990; Pineda, 1988) and did not depend on node features at all. All of the above issues were rectified by the Gated GNN (GGNN) model of Li et al. (2015). GGNNs brought many benefits of modern RNNs, such as gating mechanisms (Cho et al., 2014) and backpropagation through time, to the GNN model, and remain popular today.

Computational chemistry

It is also very important to note an independent and concurrent line of development for GNNs: one that was entirely driven by the needs of computational chemistry, where molecules are most naturally expressed as graphs of atoms (nodes) connected by chemical bonds (edges). This invited computational techniques for molecular property prediction that operate directly over such a graph structure, which had become present in machine learning in the 1990s: this includes the ChemNet model of Kireev (1995) and the work of Baskin et al. (1997). Strikingly, the “molecular graph networks” of Merkwirth and Lengauer (2005) explicitly proposed many of the elements commonly found in contemporary GNNs—such as edge type-conditioned weights or global pooling—as early as 2005. The chemical motivation continued to drive GNN development into the 2010s, with two significant GNN advancements centered around improving molecular fingerprinting (Duvenaud et al., 2015) and predicting quantum-chemical properties (Gilmer et al., 2017) from small molecules. At the time of writing this text, molecular property prediction is one of the most successful applications of GNNs, with impactful results in virtual screening of new antibiotic drugs (Stokes et al., 2020).

Node embeddings

Some of the earliest success stories of deep learning on graphs involve learning representations of nodes in an unsupervised fashion, based on the graph structure. Given their structural inspiration, this direction also provides one of the most direct links between graph representation learning and network science communities. The key early approaches in this space relied on random walk-based embeddings: learning node representations in a way that brings them closer together if the nodes co-occur in a short random walk. Representative methods in this space include DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) and LINE (Tang et al., 2015), which are all purely self-supervised. Planetoid (Yang et al., 2016) was the first in this space to incorporate supervision label information, when it is available.

Unifying random walk objectives with GNN encoders192192192Recently, a theoretical framework was developed by Srinivasan and Ribeiro (2019) in which the equivalence of structural and positional representations was demonstrated. Additionally, Qiu et al. (2018) have demonstrated that all random-walk based embedding techniques are equivalent to an appropriately-posed matrix factorisation task.was attempted on several occasions, with representative approaches including Variational Graph Autoencoder (VGAE, Kipf and Welling (2016b)), embedding propagation (García-Durán and Niepert, 2017), and unsupervised variants of GraphSAGE (Hamilton et al., 2017). However, this was met with mixed results, and it was shortly discovered that pushing neighbouring node representations together is already a key part of GNNs’ inductive bias. Indeed, it was shown that an untrained GNN was already showing performance that is competitive with DeepWalk, in settings where node features are available (Veličković et al., 2019; Wu et al., 2019). This launched a direction that moves away from combining random walk objectives with GNNs and shifting towards contrastive approaches inspired by mutual information maximisation and aligning to successful methods in the image domain. Prominent examples of this direction include Deep Graph Informax (DGI, Veličković et al. (2019)), GRACE (Zhu et al., 2020), BERT-like objectives (Hu et al., 2020) and BGRL (Thakoor et al., 2021).

Probabilistic graphical models

Graph neural networks have also, concurrently, resurged through embedding the computations of probabilistic graphical models (PGMs, Wainwright and Jordan (2008)). PGMs are a powerful tool for processing graphical data, and their utility arises from their probabilistic perspective on the graph’s edges: namely, the nodes are treated as random variables, while the graph structure encodes conditional independence assumptions, allowing for significantly simplifying the calculation and sampling from the joint distribution. Indeed, many algorithms for (exactly or approximately) supporting learning and inference on PGMs rely on forms of passing messages over their edges (Pearl, 2014), with examples including variational mean-field inference and loopy belief propagation (Yedidia et al., 2001; Murphy et al., 2013).

This connection between PGMs and message passing was subsequently developed into GNN architectures, with early theoretical links established by the authors of structure2vec (Dai et al., 2016). Namely, by posing a graph representation learning setting as a Markov random field (of nodes corresponding to input features and latent representations), the authors directly align the computation of both mean-field inference and loopy belief propagation to a model not unlike the GNNs commonly in use today.

The key “trick” which allowed for relating the latent representations of a GNN to probability distributions maintained by a PGM was the usage of Hilbert-space embeddings of distributions (Smola et al., 2007). Given ϕitalic-ϕ\phiitalic_ϕ, an appropriately chosen embedding function for features 𝐱𝐱\mathbf{x}bold_x, it is possible to embed their probability distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) as the expected embedding 𝔼𝐱p(𝐱)ϕ(𝐱)subscript𝔼similar-to𝐱𝑝𝐱italic-ϕ𝐱\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x})}\phi(\mathbf{x})blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_p ( bold_x ) end_POSTSUBSCRIPT italic_ϕ ( bold_x ). Such a correspondence allows us to perform GNN-like computations, knowing that the representations computed by the GNN will always correspond to an embedding of some probability distribution over the node features.

The structure2vec model itself is, ultimately, a GNN architecture which easily sits within our framework, but its setup has inspired a series of GNN architectures which more directly incorporate computations found in PGMs. Emerging examples have successfully combined GNNs with conditional random fields (Gao et al., 2019; Spalević et al., 2020), relational Markov networks (Qu et al., 2019) and Markov logic networks (Zhang et al., 2020).

The Weisfeiler-Lehman formalism

The resurgence of graph neural networks was followed closely by a drive to understand their fundamental limitations, especially in terms of expressive power. While it was becoming evident that GNNs are a strong modelling tool of graph-structured data, it was also clear that they wouldn’t be able to solve any task specified on a graph perfectly.193193193Due to their permutation invariance, GNNs will attach identical representations to two isomorphic graphs, so this case is trivially solved.A canonical illustrative example of this is deciding graph isomorphism: is our GNN able to attach different representations to two given non-isomorphic graphs? This is a useful framework for two reasons. If the GNN is unable to do this, then it will be hopeless on any task requiring the discrimination of these two graphs. Further, it is currently not known if deciding graph isomorphism is in P194194194The best currently known algorithm for deciding graph isomorphism is due to Babai and Luks (1983), though a recent (not fully reviewed) proposal by Babai (2016) implies a quasi-polynomial time solution., the complexity class in which all GNN computations typically reside.

The key framework which binds GNNs to graph isomorphism is the Weisfeiler-Lehman (WL) graph isomorphism test (Weisfeiler and Leman, 1968). This test generates a graph representation by iteratively passing node features along the edges of the graph, then randomly hashing their sums across neighbourhoods. Connections to randomly-initialised convolutional GNNs are apparent, and have been observed early on: for example, within the GCN model of Kipf and Welling (2016a). Aside from this connection, the WL iteration was previously introduced in the domain of graph kernels by Shervashidze et al. (2011), and it still presents a strong baseline for unsupervised learning of whole-graph representations.

While195195195[Uncaptioned image] One simple example: the WL test cannot distinguish a 6-cycle from two triangles.the WL test is conceptually simple, and there are many simple examples of non-isomorphic graphs it cannot distinguish, its expressive power is ultimately strongly tied to GNNs. Analyses by Morris et al. (2019) and Xu et al. (2018) have both reached a striking conclusion: any GNN conforming to one of the three flavours we outlined in Section 5.3 cannot be more powerful than the WL test!

In order to exactly reach this level of representational power, certain constraints must exist on the GNN update rule. Xu et al. (2018) have shown that, in the discrete-feature domain, the aggregation function the GNN uses must be injective, with summation being a key representative196196196Popular aggregators such as maximisation and averaging fall short in this regard, because they would not be able to distinguish e.g. the neighbour multisets {{𝐚,𝐛}}𝐚𝐛\{\mskip-5.0mu\{\mathbf{a},\mathbf{b}\}\mskip-5.0mu\}{ { bold_a , bold_b } } and {{𝐚,𝐚,𝐛,𝐛}}𝐚𝐚𝐛𝐛\{\mskip-5.0mu\{\mathbf{a},\mathbf{a},\mathbf{b},\mathbf{b}\}\mskip-5.0mu\}{ { bold_a , bold_a , bold_b , bold_b } }.. Based on the outcome of their analysis, Xu et al. (2018) propose the Graph Isomorphism Network (GIN), which is a simple but powerful example of a maximally-expressive GNN under this framework. It is also expressible under the convolutional GNN flavour we propose.

Lastly, it is worth noting that these findings do not generalise to continuous node feature spaces. In fact, using the Borsuk-Ulam theorem (Borsuk, 1933), Corso et al. (2020) have demonstrated that, assuming real-valued node features, obtaining injective aggregation functions requires multiple aggregators (specifically, equal to the degree of the receiver node)197197197One example of such aggregators are the moments of the multiset of neighbours.. Their findings have driven the Principal Neighbourhood Aggregation (PNA) architecture, which proposes a multiple-aggregator GNN that is empirically powerful and stable.

Higher-order methods

The findings of the previous paragraphs do not contradict the practical utility of GNNs. Indeed, in many real-world applications the input features are sufficiently rich to support useful discriminative computations over the graph structure, despite of the above limitations198198198Which, in contrast, almost always consider featureless or categorically-featured graphs..

However, one key corollary is that GNNs are relatively quite weak at detecting some rudimentary structures within a graph. Guided by the specific limitations or failure cases of the WL test, several works have provided stronger variants of GNNs that are provably more powerful than the WL test, and hence likely to be useful on tasks that require such structural detection199199199One prominent example is computational chemistry, wherein a molecule’s chemical function can be strongly influenced by the presence of aromatic rings in its molecular graph..

Perhaps the most direct place to hunt for more expressive GNNs is the WL test itself. Indeed, the strength of the original WL test can be enhanced by considering a hierarchy of WL tests, such that k𝑘kitalic_k-WL tests attach representations to k𝑘kitalic_k-tuples of nodes (Morris et al., 2017). The k𝑘kitalic_k-WL test has been directly translated into a higher-order k𝑘kitalic_k-GNN architecture by Morris et al. (2019),200200200There have been efforts, such as the δ𝛿\deltaitalic_δ-k𝑘kitalic_k-LGNN (Morris et al., 2020), to sparsify the computation of the k𝑘kitalic_k-GNN.which is provably more powerful than the GNN flavours we considered before. However, its requirement to maintain tuple representations implies that, in practice, it is hard to scale beyond k=3𝑘3k=3italic_k = 3.

Concurrently, Maron et al. (2018, 2019) have studied the characterisation of invariant and equivariant graph networks over k𝑘kitalic_k-tuples of nodes. Besides demonstrating the surprising result of any invariant or equivariant graph network being expressible as a linear combination of a finite number of generators—the amount of which only depends on k𝑘kitalic_k—the authors showed that the expressive power of such layers is equivalent to the k𝑘kitalic_k-WL test, and proposed an empirically scalable variant which is provably 3-WL powerful.

Besides generalising the domain over which representations are computed, significant effort had also went into analysing specific failure cases of 1-WL and augmenting GNN inputs to help them distinguish such cases. One common example is attaching identifying features to the nodes, which can help detecting structure201201201For example, if a node sees its own identifier k𝑘kitalic_k hops away, it is a direct indicator that it is within a k𝑘kitalic_k-cycle.. Proposals to do this include one-hot representations (Murphy et al., 2019), as well as purely random features (Sato et al., 2020).

More broadly, there have been many efforts to incorporate structural information within the message passing process, either by modulating the message function or the graph that the computations are carried over202202202In the computational chemistry domain, it is often assumed that molecular function is driven by substructures (the functional groups), which have directly inspired the modelling of molecules at a motif level. For references, consider Jin et al. (2018, 2020); Fey et al. (2020).. Several interesting lines of work here involve sampling anchor node sets (You et al., 2019), aggregating based on Laplacian eigenvectors (Stachenfeld et al., 2020; Beaini et al., 2020; Dwivedi and Bresson, 2020), or performing topological data analysis, either for positional embeddings (Bouritsas et al., 2020) or driving message passing (Bodnar et al., 2021).

Signal processing and Harmonic analysis

Since the early successes of Convolutional Neural Networks, researchers have resorted to tools from harmonic analysis, image processing, and computational neuroscience trying to provide a theoretical framework that explains their efficiency. M𝑀Mitalic_M-theory is a framework inspired by the visual cortex, pioneered by Tomaso Poggio and collaborators (Riesenhuber and Poggio, 1999; Serre et al., 2007), based on the notion of templates that can be manipulated under certain symmetry groups. Another notable model arising from computational neuroscience were steerable pyramids, a form of multiscale wavelet decompositions with favorable properties against certain input transformations, developed by Simoncelli and Freeman (1995). They were a central element in early generative models for textures (Portilla and Simoncelli, 2000), which were subsequently improved by replacing steerable wavelet features with deep CNN features Gatys et al. (2015). Finally, Scattering transforms, introduced by Stéphane Mallat (2012) and developed by Bruna and Mallat (2013), provided a framework to understand CNNs by replacing trainable filters with multiscale wavelet decompositions, also showcasing the deformation stability and the role of depth in the architecture.

Signal Processing on Graph and Meshes

Another important class of graph neural networks, often referred to as spectral, has emerged from the work of one of the authors of this text (Bruna et al., 2013), using the notion of the Graph Fourier transform. The roots of this construction are in the signal processing and computational harmonic analysis communities, where dealing with non-Euclidean signals has become prominent in the late 2000s and early 2010s. Influential papers from the groups of Pierre Vandergheynst (Shuman et al., 2013) and José Moura (Sandryhaila and Moura, 2013) popularised the notion of “Graph Signal Processing” (GSP) and the generalisation of Fourier transforms based on the eigenvectors of graph adjacency and Laplacian matrices. The graph convolutional neural networks relying on spectral filters by Defferrard et al. (2016) and Kipf and Welling (2016a) are among the most cited in the field and can likely be credited) as ones reigniting the interest in machine learning on graphs in recent years.

It is worth noting that, in the field of computer graphics and geometry processing, non-Euclidean harmonic analysis predates Graph Signal Processing by at least a decade. We can trace spectral filters on manifolds and meshes to the works of Taubin et al. (1996). These methods became mainstream in the 2000s following the influential papers of Karni and Gotsman (2000) on spectral geometry compression and of Lévy (2006) on using the Laplacian eigenvectors as a non-Euclidean Fourier basis. Spectral methods have been used for a range of applications,203203203Learnable shape descriptors similar to spectral graph CNNs were proposed by Roee Litman and Alex Bronstein (2013), the latter being a twin brother of the author of this text. most prominent of which is the construction of shape descriptors (Sun et al., 2009) and functional maps (Ovsjanikov et al., 2012); these methods are still broadly used in computer graphics at the time of writing.

Computer Graphics and Geometry Processing

Models for shape analysis based on intrinsic metric invariants were introduced by various authors in the field of computer graphics and geometry processing (Elad and Kimmel, 2003; Mémoli and Sapiro, 2005; Bronstein et al., 2006), and are discussed in depth by one of the authors in his earlier book (Bronstein et al., 2008). The notions of intrinsic symmetries were also explored in the same field Raviv et al. (2007); Ovsjanikov et al. (2008). The first architecture for deep learning on meshes, Geodesic CNNs, was developed in the team of one of the authors of the text (Masci et al., 2015). This model used local filters with shared weights, applied to geodesic radial patches. It was a particular setting of gauge-equivariant CNNs developed later by another author of the text (Cohen et al., 2019). A generalisation of Geodesic CNNs with learnable aggregation operations, MoNet, proposed by Federico Monti et al. (2017) from the same team, used an attention-like mechanism over the local structural features of the mesh, that was demonstrated to work on general graphs as well. The graph attention network (GAT), which technically speaking can be considered a particular instance of MoNet, was introduced by another author of this text (Veličković et al., 2018). GATs generalise MoNet’s attention mechanism to also incorporate node feature information, breaking away from the purely structure-derived relevance of prior work. It is one of the most popular GNN architectures currently in use.

In the context of computer graphics, it is also worthwhile to mention that the idea of learning on sets (Zaheer et al., 2017) was concurrently developed in the group of Leo Guibas at Stanford under the name PointNet (Qi et al., 2017) for the analysis of 3D point clouds. This architecture has lead to multiple follow-up works, including one by an author of this text called Dynamic Graph CNN (DGCNN, Wang et al. (2019b)). DGCNN used a nearest-neighbour graph to capture the local structure of the point cloud to allow exchange of information across the nodes; the key characteristic of this architecture was that the graph was constructed on-the-fly and updated between the layers of the neural network in relation to the downstream task. This latter property made DGCNN one of the first incarnations of ‘latent graph learning’, which in its turn has had significant follow up. Extensions to DGCNN’s k𝑘kitalic_k-nearest neighbour graph proposal include more explicit control over these graphs’ edges, either through bilevel optimisation (Franceschi et al., 2019), reinforcement learning (Kazi et al., 2020) or direct supervision (Veličković et al., 2020). Independently, a variational direction (which probabilistically samples edges from a computed posterior distribution) has emerged through the NRI model (Kipf et al., 2018). While it still relies on quadratic computation in the number of nodes, it allows for explicitly encoding uncertainty about the chosen edges.

Another very popular direction in learning on graphs without a provided graph relies on performing GNN-style computations over a complete graph, letting the network infer its own way to exploit the connectivity. The need for this arisen particularly in natural language processing, where various words in a sentence interact in highly nontrivial and non-sequential ways. Operating over a complete graph of words brought about the first incarnation of the Transformer model (Vaswani et al., 2017), which de-throned both recurrent and convolutional models as state-of-the-art in neural machine translation, and kicked off an avalanche of related work, transcending the boundaries between NLP and other fields. Fully-connected GNN computation has also concurrently emerged on simulation (Battaglia et al., 2016), reasoning (Santoro et al., 2017), and multi-agent (Hoshen, 2017) applications, and still represents a popular choice when the number of nodes is reasonably small.

Algorithmic reasoning

For most of the discussion we posed in this section, we have given examples of spatially induced geometries, which in turn shape the underlying domain, and its invariances and symmetries. However, plentiful examples of invariances and symmetries also arise in a computational setting. One critical difference to many common settings of Geometric Deep Learning is that links no longer need to encode for any kind of similarity, proximity, or types of relations—they merely specify the “recipe” for the dataflow between data points they connect.

Instead, the computations of the neural network mimic the reasoning process of an algorithm (Cormen et al., 2009), with additional invariances induced by the algorithm’s control flow and intermediate results204204204For example, one invariant of the Bellman-Ford pathfinding algorithm (Bellman, 1958) is that, after k𝑘kitalic_k steps, it will always compute the shortest paths to the source node that use no more than k𝑘kitalic_k edges.. In the space of algorithms the assumed input invariants are often referred to as preconditions, while the invariants preserved by the algorithm are known as postconditions.

Eponymously, the research direction of algorithmic reasoning (Cappart et al., 2021, Section 3.3.) seeks to produce neural network architectures that appropriately preserve algorithmic invariants. The area has investigated the construction of general-purpose neural computers, e.g., the neural Turing machine (Graves et al., 2014) and the differentiable neural computer (Graves et al., 2016). While such architectures have all the hallmarks of general computation, they introduced several components at once, making them often challenging to optimise, and in practice, they are almost always outperformed by simple relational reasoners, such as the ones proposed by Santoro et al. (2017, 2018).

As modelling complex postconditions is challenging, plentiful work on inductive biases for learning to execute (Zaremba and Sutskever, 2014) has focused on primitive algorithms (e.g. simple arithmetic). Prominent examples in this space include the neural GPU (Kaiser and Sutskever, 2015), neural RAM (Kurach et al., 2015), neural programmer-interpreters (Reed and De Freitas, 2015), neural arithmetic-logic units (Trask et al., 2018; Madsen and Johansen, 2020) and neural execution engines (Yan et al., 2020).

Emulating combinatorial algorithms of superlinear complexity was made possible with the rapid development of GNN architectures. The algorithmic alignment framework pioneered by Xu et al. (2019) demonstrated, theoretically, that GNNs align with dynamic programming (Bellman, 1966), which is a language in which most algorithms can be expressed. It was concurrently empirically shown, by one of the authors of this text, that it is possible to design and train GNNs that align with algorithmic invariants in practice (Veličković et al., 2019). Onwards, alignment was achieved with iterative algorithms (Tang et al., 2020), linearithmic algorithms (Freivalds et al., 2019), data structures (Veličković et al., 2020) and persistent memory (Strathmann et al., 2021). Such models have also seen practical use in implicit planners (Deac et al., 2020), breaking into the space of reinforcement learning algorithms.

Concurrently, significant progress has been made on using GNNs for physics simulations (Sanchez-Gonzalez et al., 2020; Pfaff et al., 2020). This direction yielded much of the same recommendations for the design of generalising GNNs. Such a correspondence is to be expected: given that algorithms can be phrased as discrete-time simulations, and simulations are typically implemented as step-wise algorithms, both directions will need to preserve similar kinds of invariants.

Tightly bound with the study of algorithmic reasoning are measures of extrapolation. This is a notorious pain-point for neural networks, given that most of their success stories are obtained when generalising in-distribution; i.e. when the patterns found in the training data properly anticipate the ones found in the test data. However, algorithmic invariants must be preserved irrespective of, e.g., the size or generative distribution of the input, meaning that the training set will likely not cover any possible scenario encountered in practice. Xu et al. (2020b) have proposed a geometric argument for what is required of an extrapolating GNN backed by rectifier activations: its components and featurisation would need to be designed so as to make its constituent modules (e.g. message function) learn only linear target functions. Bevilacqua et al. (2021) propose observing extrapolation under the lens of causal reasoning, yielding environment-invariant representations of graphs.

Geometric Deep Learning

Our final historical remarks regard the very name of this text. The term ‘Geometric Deep Learning’ was first introduced by one of the authors of this text in his ERC grant in 2015 and popularised in the eponymous IEEE Signal Processing Magazine paper (Bronstein et al., 2017). This paper proclaimed, albeit “with some caution”, the signs of “a new field being born.” Given the recent popularity of graph neural networks, the increasing use of ideas of invariance and equivariance in a broad range of machine learning applications, and the very fact of us writing this text, it is probably right to consider this prophecy at least partially fulfilled. The name “4G: Grids, Graphs, Groups, and Gauges” was coined by Max Welling for the ELLIS Program on Geometric Deep Learning, co-directed by two authors of the text. Admittedly, the last ‘G’ is somewhat of a stretch, since the underlying structures are manifolds and bundles rather than gauges. For this text, we added another ‘G’, Geodesics, in reference to metric invariants and intrinsic symmetries of manifolds.

Acknowledgements

This text represents a humble attempt to summarise and synthesise decades of existing knowledge in deep learning architectures, through the geometric lens of invariance and symmetry. We hope that our perspective will make it easier both for newcomers and practitioners to navigate the field, and for researchers to synthesise novel architectures, as instances of our blueprint. In a way, we hope to have presented “all you need to build the architectures that are all you need”—a play on words inspired by Vaswani et al. (2017).

The bulk of the text was written during late 2020 and early 2021. As it often happens, we had thousands of doubts whether the whole picture makes sense, and used opportunities provided by our colleagues to help us break our “stage fright” and present early versions of our work, which saw the light of day in Petar’s talk at Cambridge (courtesy of Pietro Liò) and Michael’s talks at Oxford (courtesy of Xiaowen Dong) and Imperial College (hosted by Michael Huth and Daniel Rueckert). Petar was also able to present our work at Friedrich-Alexander-Universität Erlangen-Nürnberg—the birthplace of the Erlangen Program!—owing to a kind invitation from Andreas Maier. The feedback we received for these talks was enormously invaluable to keeping our spirits high, as well as polishing the work further. Last, but certainly not least, we thank the organising committee of ICLR 2021, where our work will be featured in a keynote talk, delivered by Michael.

We should note that reconciling such a vast quantity of research is seldom enabled by the expertise of only four people. Accordingly, we would like to give due credit to all of the researchers who have carefully studied aspects of our text as it evolved, and provided us with careful comments and references: Yoshua Bengio, Charles Blundell, Andreea Deac, Fabian Fuchs, Francesco di Giovanni, Marco Gori, Raia Hadsell, Will Hamilton, Maksym Korablyov, Christian Merkwirth, Razvan Pascanu, Bruno Ribeiro, Anna Scaife, Jürgen Schmidhuber, Marwin Segler, Corentin Tallec, Ngân Vũ, Peter Wirnsberger and David Wong. Their expert feedback was invaluable to solidifying our unification efforts and making them more useful to various niches. Though, of course, any irregularities within this text are our responsibility alone. It is currently very much a work-in-progress, and we are very happy to receive comments at any stage. Please contact us if you spot any errors or omissions.

References

  • Aflalo and Kimmel (2013) Yonathan Aflalo and Ron Kimmel. Spectral multidimensional scaling. PNAS, 110(45):18052–18057, 2013.
  • Aflalo et al. (2015) Yonathan Aflalo, Haim Brezis, and Ron Kimmel. On the optimality of shape and data representation in the spectral domain. SIAM J. Imaging Sciences, 8(2):1141–1160, 2015.
  • Almeida (1990) Luis B Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Artificial neural networks: concept learning, pages 102–111. 1990.
  • Alon and Yahav (2020) Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. arXiv:2006.05205, 2020.
  • Amari (1978) Sl Amari. Feature spaces which admit and detect invariant signal transformations. In Joint Conference on Pattern Recognition, 1978.
  • Anderson et al. (2019) Brandon Anderson, Truong-Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. arXiv:1906.04015, 2019.
  • Anderson (1972) Philip W Anderson. More is different. Science, 177(4047):393–396, 1972.
  • Andreux et al. (2014) Mathieu Andreux, Emanuele Rodola, Mathieu Aubry, and Daniel Cremers. Anisotropic Laplace-Beltrami operators for shape analysis. In ECCV, 2014.
  • Arslan et al. (2018) Salim Arslan, Sofia Ira Ktena, Ben Glocker, and Daniel Rueckert. Graph saliency maps through spectral convolutional networks: Application to sex classification with brain connectivity. In Graphs in Biomedical Image Analysis and Integrating Medical Imaging and Non-Imaging Modalities, pages 3–13. 2018.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016.
  • Babai (2016) László Babai. Graph isomorphism in quasipolynomial time. In ACM Symposium on Theory of Computing, 2016.
  • Babai and Luks (1983) László Babai and Eugene M Luks. Canonical labeling of graphs. In ACM Symposium on Theory of computing, 1983.
  • Bach (2017) Francis Bach. Breaking the curse of dimensionality with convex neural networks. JMLR, 18(1):629–681, 2017.
  • Badia et al. (2020) Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. In ICML, 2020.
  • Badrinarayanan et al. (2017) Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. Trans. PAMI, 39(12):2481–2495, 2017.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
  • Ball (2011) Philip Ball. In retrospect: On the six-cornered snowflake. Nature, 480(7378):455–455, 2011.
  • Bamieh (2018) Bassam Bamieh. Discovering transforms: A tutorial on circulant matrices, circular convolution, and the discrete fourier transform. arXiv:1805.05533, 2018.
  • Banach (1922) Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
  • Bapst et al. (2020) Victor Bapst, Thomas Keck, A Grabska-Barwińska, Craig Donner, Ekin Dogus Cubuk, Samuel S Schoenholz, Annette Obika, Alexander WR Nelson, Trevor Back, Demis Hassabis, et al. Unveiling the predictive power of static structure in glassy systems. Nature Physics, 16(4):448–454, 2020.
  • Barabási et al. (2011) Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56–68, 2011.
  • Barron (1993) Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39(3):930–945, 1993.
  • Baskin et al. (1997) Igor I Baskin, Vladimir A Palyulin, and Nikolai S Zefirov. A neural device for searching direct correlations between structures and properties of chemical compounds. J. Chemical Information and Computer Sciences, 37(4):715–721, 1997.
  • Battaglia et al. (2016) Peter W Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. arXiv:1612.00222, 2016.
  • Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261, 2018.
  • Beaini et al. (2020) Dominique Beaini, Saro Passaro, Vincent Létourneau, William L Hamilton, Gabriele Corso, and Pietro Liò. Directional graph networks. arXiv:2010.02863, 2020.
  • Bellman (1958) Richard Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87–90, 1958.
  • Bellman (1966) Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks, 5(2):157–166, 1994.
  • Berger (2012) Marcel Berger. A panoramic view of Riemannian geometry. Springer, 2012.
  • Besson et al. (2020) Pierre Besson, Todd Parrish, Aggelos K Katsaggelos, and S Kathleen Bandt. Geometric deep learning on brain shape predicts sex and age. BioRxiv:177543, 2020.
  • Bevilacqua et al. (2021) Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. Size-invariant graph representations for graph classification extrapolations. arXiv:2103.05045, 2021.
  • Blanc et al. (2020) Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In COLT, 2020.
  • Bodnar et al. (2021) Cristian Bodnar, Fabrizio Frasca, Yu Guang Wang, Nina Otter, Guido Montúfar, Pietro Liò, and Michael Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. arXiv:2103.03212, 2021.
  • Bogatskiy et al. (2020) Alexander Bogatskiy, Brandon Anderson, Jan Offermann, Marwah Roussi, David Miller, and Risi Kondor. Lorentz group equivariant neural network for particle physics. In ICML, 2020.
  • Borsuk (1933) Karol Borsuk. Drei sätze über die n-dimensionale euklidische sphäre. Fundamenta Mathematicae, 20(1):177–190, 1933.
  • Boscaini et al. (2015) Davide Boscaini, Davide Eynard, Drosos Kourounis, and Michael M Bronstein. Shape-from-operator: Recovering shapes from intrinsic operators. Computer Graphics Forum, 34(2):265–274, 2015.
  • Boscaini et al. (2016a) Davide Boscaini, Jonathan Masci, Emanuele Rodoià, and Michael Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In NIPS, 2016a.
  • Boscaini et al. (2016b) Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Michael M Bronstein, and Daniel Cremers. Anisotropic diffusion descriptors. Computer Graphics Forum, 35(2):431–441, 2016b.
  • Bougleux et al. (2015) Sébastien Bougleux, Luc Brun, Vincenzo Carletti, Pasquale Foggia, Benoit Gaüzere, and Mario Vento. A quadratic assignment formulation of the graph edit distance. arXiv:1512.07494, 2015.
  • Bouritsas et al. (2020) Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. arXiv:2006.09252, 2020.
  • Bronstein et al. (2006) Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. PNAS, 103(5):1168–1172, 2006.
  • Bronstein et al. (2008) Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of non-rigid shapes. Springer, 2008.
  • Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv:2005.14165, 2020.
  • Bruna and Mallat (2013) Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.
  • Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2013.
  • Cappart et al. (2021) Quentin Cappart, Didier Chételat, Elias Khalil, Andrea Lodi, Christopher Morris, and Petar Veličković. Combinatorial optimization and reasoning with graph neural networks. arXiv:2102.09544, 2021.
  • Chen et al. (2018) Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv:1806.07366, 2018.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  • Chern et al. (2018) Albert Chern, Felix Knöppel, Ulrich Pinkall, and Peter Schröder. Shape from metric. ACM Trans. Graphics, 37(4):1–17, 2018.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014.
  • Choma et al. (2018) Nicholas Choma, Federico Monti, Lisa Gerhardt, Tomasz Palczewski, Zahra Ronaghi, Prabhat Prabhat, Wahid Bhimji, Michael M Bronstein, Spencer R Klein, and Joan Bruna. Graph neural networks for icecube signal classification. In ICMLA, 2018.
  • Cohen and Welling (2016) Taco Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016.
  • Cohen et al. (2019) Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral CNN. In ICML, 2019.
  • Cohen et al. (2018) Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. arXiv:1801.10130, 2018.
  • Cooijmans et al. (2016) Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. Recurrent batch normalization. arXiv:1603.09025, 2016.
  • Corman et al. (2017) Etienne Corman, Justin Solomon, Mirela Ben-Chen, Leonidas Guibas, and Maks Ovsjanikov. Functional characterization of intrinsic and extrinsic geometry. ACM Trans. Graphics, 36(2):1–17, 2017.
  • Cormen et al. (2009) Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
  • Corso et al. (2020) Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. arXiv:2004.05718, 2020.
  • Cosmo et al. (2020) Luca Cosmo, Anees Kazi, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael Bronstein. Latent-graph learning for disease prediction. In MICCAI, 2020.
  • Cranmer et al. (2020) Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. arXiv:2003.04630, 2020.
  • Cranmer et al. (2019) Miles D Cranmer, Rui Xu, Peter Battaglia, and Shirley Ho. Learning symbolic physics with graph networks. arXiv:1909.05862, 2019.
  • Cucurull et al. (2018) Guillem Cucurull, Konrad Wagstyl, Arantxa Casanova, Petar Veličković, Estrid Jakobsen, Michal Drozdzal, Adriana Romero, Alan Evans, and Yoshua Bengio. Convolutional neural networks for mesh-based parcellation of the cerebral cortex. 2018.
  • Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
  • Dai et al. (2016) Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
  • De Fauw et al. (2018) Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine, 24(9):1342–1350, 2018.
  • de Haan et al. (2020) Pim de Haan, Maurice Weiler, Taco Cohen, and Max Welling. Gauge equivariant mesh CNNs: Anisotropic convolutions on geometric graphs. In NeurIPS, 2020.
  • Deac et al. (2019) Andreea Deac, Petar Veličković, and Pietro Sormanni. Attentive cross-modal paratope prediction. Journal of Computational Biology, 26(6):536–545, 2019.
  • Deac et al. (2020) Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, and Mladen Nikolić. Xlvin: executed latent value iteration nets. arXiv:2010.13146, 2020.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. NIPS, 2016.
  • Derrow-Pinion et al. (2021) Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, Todd Hester, Luis Perez, Marc Nunkesser, Seongjae Lee, Xueying Guo, Peter W Battaglia, Vishal Gupta, Ang Li, Zhongwen Xu, Alvaro Sanchez-Gonzalez, Yujia Li, and Petar Veličković. Traffic Prediction with Graph Neural Networks in Google Maps. 2021.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. NIPS, 2015.
  • Dwivedi and Bresson (2020) Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. arXiv:2012.09699, 2020.
  • Elad and Kimmel (2003) Asi Elad and Ron Kimmel. On bending invariant signatures for surfaces. Trans. PAMI, 25(10):1285–1295, 2003.
  • Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
  • Esteves et al. (2020) Carlos Esteves, Ameesh Makadia, and Kostas Daniilidis. Spin-weighted spherical CNNs. arXiv:2006.10731, 2020.
  • Fang et al. (2020) Xiaomin Fang, Jizhou Huang, Fan Wang, Lingke Zeng, Haijin Liang, and Haifeng Wang. ConSTGAT: Contextual spatial-temporal graph attention network for travel time estimation at baidu maps. In KDD, 2020.
  • Fey et al. (2020) Matthias Fey, Jan-Gin Yuen, and Frank Weichert. Hierarchical inter-message passing for learning on molecular graphs. arXiv:2006.12179, 2020.
  • Finzi et al. (2020) Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. In ICML, 2020.
  • Folkman (1967) Jon Folkman. Regular line-symmetric graphs. Journal of Combinatorial Theory, 3(3):215–232, 1967.
  • Franceschi et al. (2019) Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In ICML, 2019.
  • Frasconi et al. (1998) Paolo Frasconi, Marco Gori, and Alessandro Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. Neural Networks, 9(5):768–786, 1998.
  • Freivalds et al. (2019) Kārlis Freivalds, Emīls Ozoliņš, and Agris Šostaks. Neural shuffle-exchange networks–sequence processing in o (n log n) time. arXiv:1907.07897, 2019.
  • Fuchs et al. (2020) Fabian B Fuchs, Daniel E Worrall, Volker Fischer, and Max Welling. SE(3)-transformers: 3D roto-translation equivariant attention networks. arXiv:2006.10503, 2020.
  • Fukushima and Miyake (1982) Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer, 1982.
  • Gainza et al. (2020) Pablo Gainza, Freyr Sverrisson, Frederico Monti, Emanuele Rodola, D Boscaini, MM Bronstein, and BE Correia. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020.
  • Gama et al. (2019) Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Diffusion scattering transforms on graphs. In ICLR, 2019.
  • Gama et al. (2020) Fernando Gama, Joan Bruna, and Alejandro Ribeiro. Stability properties of graph neural networks. IEEE Trans. Signal Processing, 68:5680–5695, 2020.
  • Gao et al. (2019) Hongchang Gao, Jian Pei, and Heng Huang. Conditional random field enhanced graph convolutional neural networks. In KDD, 2019.
  • García-Durán and Niepert (2017) Alberto García-Durán and Mathias Niepert. Learning graph representations with embedding propagation. arXiv:1710.03059, 2017.
  • Gatys et al. (2015) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. arXiv preprint arXiv:1505.07376, 2015.
  • Gaudelet et al. (2020) Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy BR Hayter, Richard Vickers, Charles Roberts, Jian Tang, et al. Utilising graph machine learning within drug discovery and development. arXiv:2012.05716, 2020.
  • Gers and Schmidhuber (2000) Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and count. In IJCNN, 2000.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv:1704.01212, 2017.
  • Girshick (2015) Ross Girshick. Fast R-CNN. In CVPR, 2015.
  • Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • Gligorijevic et al. (2020) Vladimir Gligorijevic, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis, et al. Structure-based function prediction using graph convolutional networks. bioRxiv:786236, 2020.
  • Goller and Kuchler (1996) Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In ICNN, 1996.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv:1406.2661, 2014.
  • Gori et al. (2005) Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In IJCNN, 2005.
  • Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv:1410.5401, 2014.
  • Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv:2006.07733, 2020.
  • Gromov (1981) Mikhael Gromov. Structures métriques pour les variétés riemanniennes. Cedic, 1981.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
  • Gunasekar et al. (2017) Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In NIPS, 2017.
  • Gysi et al. (2020) Deisy Morselli Gysi, Ítalo Do Valle, Marinka Zitnik, Asher Ameli, Xiao Gan, Onur Varol, Helia Sanchez, Rebecca Marlene Baron, Dina Ghiassian, Joseph Loscalzo, et al. Network medicine framework for identifying drug repurposing opportunities for COVID-19. arXiv:2004.07229, 2020.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
  • Hao et al. (2020) Junheng Hao, Tong Zhao, Jin Li, Xin Luna Dong, Christos Faloutsos, Yizhou Sun, and Wei Wang. P-companion: A principled framework for diversified complementary product recommendation. In Information & Knowledge Management, 2020.
  • Hardt and Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv:1611.04231, 2016.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In CVPR, 2017.
  • Helvétius (1759) Claude Adrien Helvétius. De l’esprit. Durand, 1759.
  • Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
  • Hochreiter (1991) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, Technische Universität München, 1991.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hornik (1991) Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
  • Hoshen (2017) Yedid Hoshen. Vain: Attentional multi-agent predictive modeling. arXiv:1706.06122, 2017.
  • Hu et al. (2020) Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In ICLR, 2020.
  • Hubel and Wiesel (1959) David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate cortex. J. Physiology, 148(3):574–591, 1959.
  • Hutchinson et al. (2020) Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. LieTransformer: Equivariant self-attention for Lie groups. arXiv:2012.10885, 2020.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • Iqbal (2018) Haris Iqbal. Harisiqbal88/plotneuralnet v1.0.0, December 2018. URL https://doi.org/10.5281/zenodo.2526396.
  • Itani and Thanou (2021) Sarah Itani and Dorina Thanou. Combining anatomical and functional networks for neuropathology identification: A case study on autism spectrum disorder. Medical Image Analysis, 69:101986, 2021.
  • Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In ICML, 2018.
  • Jin et al. (2020) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Hierarchical generation of molecular graphs using structural motifs. In ICML, 2020.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
  • Jordan (1997) Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in Psychology, volume 121, pages 471–495. 1997.
  • Joshi (2020) Chaitanya Joshi. Transformers are graph neural networks. The Gradient, 2020.
  • Jozefowicz et al. (2015) Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In ICML, 2015.
  • Kaiser and Sutskever (2015) Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv:1511.08228, 2015.
  • Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv:1610.10099, 2016.
  • Kalchbrenner et al. (2018) Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In ICML, 2018.
  • Kanatani (2012) Ken-Ichi Kanatani. Group-theoretical methods in image understanding. Springer, 2012.
  • Karni and Gotsman (2000) Zachi Karni and Craig Gotsman. Spectral compression of mesh geometry. In Proc. Computer Graphics and Interactive Techniques, 2000.
  • Kazi et al. (2020) Anees Kazi, Luca Cosmo, Nassir Navab, and Michael Bronstein. Differentiable graph module (DGM) graph convolutional networks. arXiv:2002.04999, 2020.
  • Kenlay et al. (2021) Henry Kenlay, Dorina Thanou, and Xiaowen Dong. Interpretable stability bounds for spectral graph filters. arXiv:2102.09587, 2021.
  • Kimmel and Sethian (1998) Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds. PNAS, 95(15):8431–8435, 1998.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  • Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. In ICML, 2018.
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907, 2016a.
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv:1611.07308, 2016b.
  • Kireev (1995) Dmitry B Kireev. Chemnet: a novel neural network based method for graph/property mapping. J. Chemical Information and Computer Sciences, 35(2):175–180, 1995.
  • Klicpera et al. (2020) Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. arXiv:2003.03123, 2020.
  • Kokkinos et al. (2012) Iasonas Kokkinos, Michael M Bronstein, Roee Litman, and Alex M Bronstein. Intrinsic shape context descriptors for deformable shapes. In CVPR, 2012.
  • Komiske et al. (2019) Patrick T Komiske, Eric M Metodiev, and Jesse Thaler. Energy flow networks: deep sets for particle jets. Journal of High Energy Physics, 2019(1):121, 2019.
  • Kostrikov et al. (2018) Ilya Kostrikov, Zhongshi Jiang, Daniele Panozzo, Denis Zorin, and Joan Bruna. Surface networks. In CVPR, 2018.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • Ktena et al. (2017) Sofia Ira Ktena, Sarah Parisot, Enzo Ferrante, Martin Rajchl, Matthew Lee, Ben Glocker, and Daniel Rueckert. Distance metric learning using graph convolutional networks: Application to functional brain networks. In MICCAI, 2017.
  • Kulon et al. (2020) Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos, Michael M Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In CVPR, 2020.
  • Kurach et al. (2015) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. arXiv:1511.06392, 2015.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
  • Lenz (1990) Reiner Lenz. Group theoretical methods in image processing. Springer, 1990.
  • Leshno et al. (1993) Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
  • Levie et al. (2018) Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Processing, 67(1):97–109, 2018.
  • Levie et al. (2019) Ron Levie, Elvin Isufi, and Gitta Kutyniok. On the transferability of spectral graph filters. In Sampling Theory and Applications, 2019.
  • Lévy (2006) Bruno Lévy. Laplace-Beltrami eigenfunctions towards an algorithm that “understands” geometry. In Proc. Shape Modeling and Applications, 2006.
  • Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv:1511.05493, 2015.
  • Litany et al. (2018) Or Litany, Alex Bronstein, Michael Bronstein, and Ameesh Makadia. Deformable shape completion with graph convolutional autoencoders. In CVPR, 2018.
  • Litman and Bronstein (2013) Roee Litman and Alexander M Bronstein. Learning spectral descriptors for deformable shape correspondence. Trans. PAMI, 36(1):171–180, 2013.
  • Liu et al. (2017) Hsueh-Ti Derek Liu, Alec Jacobson, and Keenan Crane. A Dirac operator for extrinsic shape analysis. Computer Graphics Forum, 36(5):139–149, 2017.
  • Lyu and Simoncelli (2008) Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization. In CVPR, 2008.
  • MacNeal (1949) Richard H MacNeal. The solution of partial differential equations by means of electrical networks. PhD thesis, California Institute of Technology, 1949.
  • Madsen and Johansen (2020) Andreas Madsen and Alexander Rosenberg Johansen. Neural arithmetic units. arXiv:2001.05016, 2020.
  • Mahdi et al. (2020) Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Shunwang Gong, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, and Peter Claes. 3d facial matching by spiral convolutional metric learning and a biometric fusion-net of demographic properties. arXiv:2009.04746, 2020.
  • Maiorov (1999) VE Maiorov. On best approximation by ridge functions. Journal of Approximation Theory, 99(1):68–94, 1999.
  • Makadia et al. (2007) Ameesh Makadia, Christopher Geyer, and Kostas Daniilidis. Correspondence-free structure from motion. IJCV, 75(3):311–327, 2007.
  • Mallat (1999) Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
  • Mallat (2012) Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
  • Malone et al. (2018) Brandon Malone, Alberto Garcia-Duran, and Mathias Niepert. Learning representations of missing data for predicting patient outcomes. arXiv:1811.04752, 2018.
  • Maron et al. (2018) Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. arXiv:1812.09902, 2018.
  • Maron et al. (2019) Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. arXiv:1905.11136, 2019.
  • Marquis (2009) Jean-Pierre Marquis. Category theory and klein’s erlangen program. In From a Geometrical Point of View, pages 9–40. Springer, 2009.
  • Masci et al. (2015) Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on Riemannian manifolds. In CVPR Workshops, 2015.
  • Maxwell (1865) James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical Transactions of the Royal Society of London, (155):459–512, 1865.
  • McEwen et al. (2021) Jason D McEwen, Christopher GR Wallis, and Augustine N Mavor-Parker. Scattering networks on the sphere for scalable and rotationally equivariant spherical cnns. arXiv:2102.02828, 2021.
  • Mei et al. (2021) Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models. arXiv:2102.13219, 2021.
  • Melzi et al. (2019) Simone Melzi, Riccardo Spezialetti, Federico Tombari, Michael M Bronstein, Luigi Di Stefano, and Emanuele Rodolà. Gframes: Gradient-based local reference frame for 3d shape matching. In CVPR, 2019.
  • Mémoli and Sapiro (2005) Facundo Mémoli and Guillermo Sapiro. A theoretical and computational framework for isometry invariant recognition of point cloud data. Foundations of Computational Mathematics, 5(3):313–347, 2005.
  • Merkwirth and Lengauer (2005) Christian Merkwirth and Thomas Lengauer. Automatic generation of complementary descriptors with molecular graph networks. J. Chemical Information and Modeling, 45(5):1159–1168, 2005.
  • Meyer et al. (2003) Mark Meyer, Mathieu Desbrun, Peter Schröder, and Alan H Barr. Discrete differential-geometry operators for triangulated 2-manifolds. In Visualization and Mathematics III, pages 35–57. 2003.
  • Micheli (2009) Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE Trans. Neural Networks, 20(3):498–511, 2009.
  • Miller et al. (2016) Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature Neuroscience, 19(11):1523–1536, 2016.
  • Minsky and Papert (2017) Marvin Minsky and Seymour A Papert. Perceptrons: An introduction to computational geometry. MIT Press, 2017.
  • Mitrovic et al. (2020) Jovana Mitrovic, Brian McWilliams, Jacob Walker, Lars Buesing, and Charles Blundell. Representation learning via invariant causal mechanisms. arXiv:2010.07922, 2020.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
  • Monti et al. (2017) Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, 2017.
  • Monti et al. (2019) Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, and Michael M Bronstein. Fake news detection on social media using geometric deep learning. arXiv:1902.06673, 2019.
  • Morris et al. (2017) Christopher Morris, Kristian Kersting, and Petra Mutzel. Glocalized Weisfeiler-Lehman graph kernels: Global-local feature maps of graphs. In ICDM, 2017.
  • Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI, 2019.
  • Morris et al. (2020) Christopher Morris, Gaurav Rattan, and Petra Mutzel. Weisfeiler and Leman go sparse: Towards scalable higher-order graph embeddings. In NeurIPS, 2020.
  • Mozer (1989) Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition. Complex Systems, 3(4):349–381, 1989.
  • Murphy et al. (2013) Kevin Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference: An empirical study. arXiv:1301.6725, 2013.
  • Murphy et al. (2019) Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling for graph representations. In ICML, 2019.
  • Murphy et al. (2018) Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. arXiv:1811.01900, 2018.
  • Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • Nash (1956) John Nash. The imbedding problem for Riemannian manifolds. Annals of Mathematics, 63(1):20––63, 1956.
  • Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In COLT, 2015.
  • Noether (1918) Emmy Noether. Invariante variationsprobleme. In König Gesellsch. d. Wiss. zu Göttingen, Math-Phys. Klassc, pages 235–257. 1918.
  • Ovsjanikov et al. (2008) Maks Ovsjanikov, Jian Sun, and Leonidas Guibas. Global intrinsic symmetries of shapes. Computer Graphics Forum, 27(5):1341–1348, 2008.
  • Ovsjanikov et al. (2012) Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. Functional maps: a flexible representation of maps between shapes. ACM Trans. Graphics, 31(4):1–11, 2012.
  • Pal et al. (2020) Aditya Pal, Chantat Eksombatchai, Yitong Zhou, Bo Zhao, Charles Rosenberg, and Jure Leskovec. Pinnersage: Multi-modal user embedding framework for recommendations at pinterest. In KDD, 2020.
  • Parisot et al. (2018) Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrero, Ben Glocker, and Daniel Rueckert. Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease. Medical Image Analysis, 48:117–130, 2018.
  • Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
  • Patanè (2020) Giuseppe Patanè. Fourier-based and rational graph filters for spectral processing. arXiv:2011.04055, 2020.
  • Pearl (2014) Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014.
  • Penrose (2005) Roger Penrose. The road to reality: A complete guide to the laws of the universe. Random House, 2005.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, 2014.
  • Pfaff et al. (2020) Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W Battaglia. Learning mesh-based simulation with graph networks. arXiv:2010.03409, 2020.
  • Pineda (1988) Fernando J Pineda. Generalization of back propagation to recurrent and higher order neural networks. In NIPS, 1988.
  • Pinkall and Polthier (1993) Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their conjugates. Experimental Mathematics, 2(1):15–36, 1993.
  • Pinkus (1999) Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143–195, 1999.
  • Pollard et al. (2018) Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific Data, 5(1):1–13, 2018.
  • Portilla and Simoncelli (2000) Javier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
  • Qi et al. (2017) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM, 2018.
  • Qu and Gouskos (2019) H Qu and L Gouskos. Particlenet: jet tagging via particle clouds. arXiv:1902.08570, 2019.
  • Qu et al. (2019) Meng Qu, Yoshua Bengio, and Jian Tang. GMNN: Graph Markov neural networks. In ICML, 2019.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Ranjan et al. (2018) Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black. Generating 3D faces using convolutional mesh autoencoders. In ECCV, 2018.
  • Raviv et al. (2007) Dan Raviv, Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Symmetries of non-rigid shapes. In ICCV, 2007.
  • Razin and Cohen (2020) Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. arXiv:2005.06398, 2020.
  • Reed and De Freitas (2015) Scott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv:1511.06279, 2015.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv:1506.01497, 2015.
  • Rezende and Mohamed (2015) Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015.
  • Riesenhuber and Poggio (1999) Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999.
  • Robinson and Fallside (1987) AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network. University of Cambridge, 1987.
  • Rocheteau et al. (2020) Emma Rocheteau, Pietro Liò, and Stephanie Hyland. Temporal pointwise convolutional networks for length of stay prediction in the intensive care unit. arXiv:2007.09483, 2020.
  • Rocheteau et al. (2021) Emma Rocheteau, Catherine Tong, Petar Veličković, Nicholas Lane, and Pietro Liò. Predicting patient outcomes with graph representation learning. arXiv:2101.03940, 2021.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • Rosenblatt (1958) Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958.
  • Rossi et al. (2020) Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. arXiv:2006.10637, 2020.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • Rustamov et al. (2013) Raif M Rustamov, Maks Ovsjanikov, Omri Azencot, Mirela Ben-Chen, Frédéric Chazal, and Leonidas Guibas. Map-based exploration of intrinsic shape differences and variability. ACM Trans. Graphics, 32(4):1–12, 2013.
  • Salimans and Kingma (2016) Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv:1602.07868, 2016.
  • Sanchez-Gonzalez et al. (2019) Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and Peter Battaglia. Hamiltonian graph networks with ODE integrators. arXiv:1909.12790, 2019.
  • Sanchez-Gonzalez et al. (2020) Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. In ICML, 2020.
  • Sandryhaila and Moura (2013) Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE Trans. Signal Processing, 61(7):1644–1656, 2013.
  • Santoro et al. (2017) Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
  • Santoro et al. (2018) Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. arXiv:1806.01822, 2018.
  • Santurkar et al. (2018) Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? arXiv:1805.11604, 2018.
  • Sato et al. (2020) Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks. arXiv:2002.03155, 2020.
  • Satorras et al. (2021) Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks. arXiv:2102.09844, 2021.
  • Scaife and Porter (2021) Anna MM Scaife and Fiona Porter. Fanaroff-Riley classification of radio galaxies using group-equivariant convolutional neural networks. Monthly Notices of the Royal Astronomical Society, 2021.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2008.
  • Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
  • Schütt et al. (2018) Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-R Müller. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24):241722, 2018.
  • Sejnowski et al. (1986) Terrence J Sejnowski, Paul K Kienker, and Geoffrey E Hinton. Learning symmetry groups with hidden units: Beyond the perceptron. Physica D: Nonlinear Phenomena, 22(1-3):260–275, 1986.
  • Senior et al. (2020) Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander WR Nelson, Alex Bridgland, et al. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020.
  • Serre et al. (2007) Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the national academy of sciences, 104(15):6424–6429, 2007.
  • Shamir and Vardi (2020) Ohad Shamir and Gal Vardi. Implicit regularization in relu networks with the square loss. arXiv:2012.05156, 2020.
  • Shawe-Taylor (1989) John Shawe-Taylor. Building symmetries into feedforward networks. In ICANN, 1989.
  • Shawe-Taylor (1993) John Shawe-Taylor. Symmetries and discriminability in feedforward network architectures. IEEE Trans. Neural Networks, 4(5):816–826, 1993.
  • Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. JMLR, 12(9), 2011.
  • Shlomi et al. (2020) Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics. Machine Learning: Science and Technology, 2(2):021001, 2020.
  • Shuman et al. (2013) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013.
  • Siegelmann and Sontag (1995) Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  • Simoncelli and Freeman (1995) Eero P Simoncelli and William T Freeman. The steerable pyramid: A flexible architecture for multi-scale derivative computation. In Proceedings., International Conference on Image Processing, volume 3, pages 444–447. IEEE, 1995.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
  • Smola et al. (2007) Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A Hilbert space embedding for distributions. In ALT, 2007.
  • Spalević et al. (2020) Stefan Spalević, Petar Veličković, Jovana Kovačević, and Mladen Nikolić. Hierachial protein function prediction with tail-GNNs. arXiv:2007.12804, 2020.
  • Sperduti (1994) Alessandro Sperduti. Encoding labeled graphs by labeling RAAM. In NIPS, 1994.
  • Sperduti and Starita (1997) Alessandro Sperduti and Antonina Starita. Supervised neural networks for the classification of structures. IEEE Trans. Neural Networks, 8(3):714–735, 1997.
  • Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806, 2014.
  • Srinivasan and Ribeiro (2019) Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional node embeddings and structural graph representations. arXiv:1910.00452, 2019.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
  • Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
  • Stachenfeld et al. (2020) Kimberly Stachenfeld, Jonathan Godwin, and Peter Battaglia. Graph networks with spectral message passing. arXiv:2101.00079, 2020.
  • Stokes et al. (2020) Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackerman, et al. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702, 2020.
  • Strathmann et al. (2021) Heiko Strathmann, Mohammadamin Barekatain, Charles Blundell, and Petar Veličković. Persistent message passing. arXiv:2103.01043, 2021.
  • Straumann (1996) Norbert Straumann. Early history of gauge theories and weak interactions. hep-ph/9609230, 1996.
  • Sun et al. (2009) Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature based on heat diffusion. Computer Graphics Forum, 28(5):1383–1392, 2009.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. arXiv:1409.3215, 2014.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • Tallec and Ollivier (2018) Corentin Tallec and Yann Ollivier. Can recurrent neural networks warp time? arXiv:1804.11188, 2018.
  • Tang et al. (2020) Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, and Hao Su. Towards scale-invariant graph-related problem solving by iterative homogeneous gnns. In NeurIPS, 2020.
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, 2015.
  • Taubin et al. (1996) Gabriel Taubin, Tong Zhang, and Gene Golub. Optimal surface smoothing as filter design. In ECCV, 1996.
  • Thakoor et al. (2021) Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Rémi Munos, Petar Veličković, and Michal Valko. Bootstrapped representation learning on graphs. arXiv:2102.06514, 2021.
  • Thomas et al. (2018) Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds. arXiv:1802.08219, 2018.
  • Tobies (2019) Renate Tobies. Felix Klein—-mathematician, academic organizer, educational reformer. In The Legacy of Felix Klein, pages 5–21. Springer, 2019.
  • Trask et al. (2018) Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units. arXiv:1808.00508, 2018.
  • Tromp and Farnebäck (2006) John Tromp and Gunnar Farnebäck. Combinatorics of go. In International Conference on Computers and Games, 2006.
  • Tsybakov (2008) Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2008.
  • Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
  • van den Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016a.
  • van den Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016b.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. ICLR, 2018.
  • Veličković et al. (2019) Petar Veličković, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural execution of graph algorithms. arXiv:1910.10593, 2019.
  • Veličković et al. (2020) Petar Veličković, Lars Buesing, Matthew C Overlan, Razvan Pascanu, Oriol Vinyals, and Charles Blundell. Pointer graph networks. arXiv:2006.06380, 2020.
  • Veličković et al. (2019) Petar Veličković, Wiliam Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep Graph Infomax. In ICLR, 2019.
  • Veselkov et al. (2019) Kirill Veselkov, Guadalupe Gonzalez, Shahad Aljifri, Dieter Galea, Reza Mirnezami, Jozef Youssef, Michael Bronstein, and Ivan Laponogov. Hyperfoods: Machine intelligent mapping of cancer-beating molecules in foods. Scientific Reports, 9(1):1–12, 2019.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. arXiv:1506.03134, 2015.
  • Vinyals et al. (2016) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In ICLR, 2016.
  • von Luxburg and Bousquet (2004) Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. JMLR, 5:669–695, 2004.
  • Wainwright and Jordan (2008) Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
  • Wang and Solomon (2019) Yu Wang and Justin Solomon. Intrinsic and extrinsic operators for shape analysis. In Handbook of Numerical Analysis, volume 20, pages 41–115. Elsevier, 2019.
  • Wang et al. (2018) Yu Wang, Mirela Ben-Chen, Iosif Polterovich, and Justin Solomon. Steklov spectral geometry for extrinsic shape analysis. ACM Trans. Graphics, 38(1):1–21, 2018.
  • Wang et al. (2019a) Yu Wang, Vladimir Kim, Michael Bronstein, and Justin Solomon. Learning geometric operators on meshes. In ICLR Workshops, 2019a.
  • Wang et al. (2019b) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph CNN for learning on point clouds. ACM Trans. Graphics, 38(5):1–12, 2019b.
  • Wardetzky (2008) Max Wardetzky. Convergence of the cotangent formula: An overview. Discrete Differential Geometry, pages 275–286, 2008.
  • Wardetzky et al. (2007) Max Wardetzky, Saurabh Mathur, Felix Kälberer, and Eitan Grinspun. Discrete Laplace operators: no free lunch. In Symposium on Geometry Processing, 2007.
  • Weiler et al. (2018) Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. arXiv:1807.02547, 2018.
  • Weisfeiler and Leman (1968) Boris Weisfeiler and Andrei Leman. The reduction of a graph to canonical form and the algebra which appears therein. NTI Series, 2(9):12–16, 1968.
  • Werbos (1988) Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.
  • Weyl (1929) Hermann Weyl. Elektron und gravitation. i. Zeitschrift für Physik, 56(5-6):330–352, 1929.
  • Weyl (2015) Hermann Weyl. Symmetry. Princeton University Press, 2015.
  • Winkels and Cohen (2019) Marysia Winkels and Taco S Cohen. Pulmonary nodule detection in ct scans with equivariant cnns. Medical Image Analysis, 55:15–26, 2019.
  • Wood and Shawe-Taylor (1996) Jeffrey Wood and John Shawe-Taylor. Representation theory and invariant neural networks. Discrete Applied Mathematics, 69(1-2):33–60, 1996.
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
  • Wu and He (2018) Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
  • Xu et al. (2020a) Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. arXiv:2002.07962, 2020a.
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv:1810.00826, 2018.
  • Xu et al. (2019) Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? arXiv:1905.13211, 2019.
  • Xu et al. (2020b) Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. How neural networks extrapolate: From feedforward to graph neural networks. arXiv:2009.11848, 2020b.
  • Yan et al. (2020) Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Heshemi. Neural execution engines: Learning to execute subroutines. arXiv:2006.08084, 2020.
  • Yang and Mills (1954) Chen-Ning Yang and Robert L Mills. Conservation of isotopic spin and isotopic gauge invariance. Physical Review, 96(1):191, 1954.
  • Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In ICML, 2016.
  • Yedidia et al. (2001) Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. NIPS, 2001.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In KDD, 2018.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In ICML, 2019.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In NIPS, 2017.
  • Zaremba and Sutskever (2014) Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv:1410.4615, 2014.
  • Zeng et al. (2012) Wei Zeng, Ren Guo, Feng Luo, and Xianfeng Gu. Discrete heat kernel determines discrete riemannian metric. Graphical Models, 74(4):121–129, 2012.
  • Zhang et al. (2018) Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv:1803.07294, 2018.
  • Zhang et al. (2020) Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, and Le Song. Efficient probabilistic logic reasoning with graph neural networks. arXiv:2001.11850, 2020.
  • Zhu et al. (2019) Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. Aligraph: A comprehensive graph neural network platform. arXiv:1902.08730, 2019.
  • Zhu and Razavian (2019) Weicheng Zhu and Narges Razavian. Variationally regularized graph-based representation learning for electronic health records. arXiv:1912.03761, 2019.
  • Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv:2006.04131, 2020.
  • Zitnik et al. (2018) Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.